Skip to main content

Questions tagged [data-preprocessing]

A step of cleaning data in data mining for analysis purposes

5 votes
1 answer
132 views

I am working on a benchmark study of survival models and that is why, I am working with a wider array of survival datasets. In my repository, I have 50 survival datasets including regular events and ...
Sultan Ahmed Sagor's user avatar
0 votes
0 answers
55 views

I have a dataset with ~20.000 entries containing mean values for different groups. The groups are defined with 4 categorical columns and I have the week number, the number of samples per week and the ...
Dee Vee's user avatar
2 votes
0 answers
31 views

The cross-validation function cv.glmnet, for regularized regression, does not seem to allow for separate transformation/preprocessing of training and validation ...
DriesB's user avatar
  • 21
5 votes
1 answer
45 views

I'm working on a classification problem where the goal is to maximize the F1-score, hopefully above 80%. Despite a very thorough EDA and preprocessing workflow, I've hit a hard performance ceiling ...
hijunyng's user avatar
0 votes
0 answers
80 views

I am currently conducting an online survey in a factorial setting ("vignette study"). I have 8 vignettes in total, varying in three dimensions (let us call them Dimension A, Dimension B and ...
trimmu's user avatar
  • 11
0 votes
0 answers
49 views

I am currently working on the project where I need to assign customers across N recipes before AB testing such that KPIs for each customer are balanced across recipes (reduce pre-test bias) Dataset ...
Rishab's user avatar
  • 31
4 votes
1 answer
143 views

Conceptually, I understand that models should be built totally blind to the test set in order to most faithfully estimate performance on future data. However, I'm struggling to understand the extent ...
Evan's user avatar
  • 329
1 vote
0 answers
39 views

I am doing an analysis on NIR spectra of which I am trying to measure a physical property which I mostly expect to be scatter. However my samples have a complex surface morphology and I need some ...
phil27's user avatar
  • 11
1 vote
0 answers
75 views

I've recently learnt unsupervised learning methods such as KMeans and DBSCAN. While working on this dataset, I applied KMeans clustering but faced the following issues: The Elbow Method showed no ...
ssmalik's user avatar
  • 41
3 votes
2 answers
601 views

I'm applying K-Means clustering to a dataset of ship voyages. The goal is to group voyages into performance-based clusters like cost-efficient, underperforming, etc. I have 12 features in total: 10 ...
ssmalik's user avatar
  • 41
3 votes
1 answer
210 views

I am working with some data in which the output target values $(Y)$ are all strictly positive values, essentially in the range of 0.001 to 100. Since these values can inherently never be negative or ...
Applesauce44's user avatar
1 vote
0 answers
128 views

I've been reading into different Gaussian processes recently to better fit some data that I'm working with. My data clearly does not follow a multivariate Gaussian as required for a standard exact ...
Applesauce44's user avatar
0 votes
1 answer
104 views

Can outlier removal be done only on one class in a binary classification problem? when facing with class imbalance for example, can it be done only on majority class? if so, is there any paper on this ...
vhd's user avatar
  • 25
6 votes
0 answers
324 views

Assume we are only able to observe two-way entry table counting the number of observations of a pair of categorical features $x_i,x_j$. $$ \begin{array}{c|ccc} & & x_j & \\ \hline ...
Three Diag's user avatar
1 vote
0 answers
63 views

I have some time series data with multiple features. The output is shifted (I mean the times at which I have the output values are shifted from the corresponding inputs and also irregularly). I have ...
Ash Ketchump's user avatar

15 30 50 per page
1
2 3 4 5
36