Newest 'data-preprocessing' Questions

5 votes

1 answer

132 views

Pre-processing of longitudinal data to convert to a regular survival analysis

I am working on a benchmark study of survival models and that is why, I am working with a wider array of survival datasets. In my repository, I have 50 survival datasets including regular events and ...

Sultan Ahmed Sagor

223

asked Feb 23 at 1:33

0 votes

0 answers

55 views

Outlier detection in many short time series

I have a dataset with ~20.000 entries containing mean values for different groups. The groups are defined with 4 categorical columns and I have the week number, the number of samples per week and the ...

Dee Vee

1

asked Nov 25, 2025 at 11:34

2 votes

0 answers

31 views

How to separate transformation/preprocessing of training and validation datasets in glmnet? [closed]

The cross-validation function cv.glmnet, for regularized regression, does not seem to allow for separate transformation/preprocessing of training and validation ...

DriesB

21

asked Oct 30, 2025 at 15:21

5 votes

1 answer

45 views

Why are all my tuned models (DT, GB, SVM) plateauing at ~70% F1 after rigorous data cleaning and feature engineering?

I'm working on a classification problem where the goal is to maximize the F1-score, hopefully above 80%. Despite a very thorough EDA and preprocessing workflow, I've hit a hard performance ceiling ...

hijunyng

51

asked Oct 11, 2025 at 12:16

0 votes

0 answers

80 views

Fitting mixed effect model to factorial survey data

I am currently conducting an online survey in a factorial setting ("vignette study"). I have 8 vignettes in total, varying in three dimensions (let us call them Dimension A, Dimension B and ...

trimmu

11

asked Oct 6, 2025 at 10:31

0 votes

0 answers

49 views

Are there clustering algorithms or preprocessing strategies tailored for zero-inflated and continuous data types?

I am currently working on the project where I need to assign customers across N recipes before AB testing such that KPIs for each customer are balanced across recipes (reduce pre-test bias) Dataset ...

Rishab

31

asked Sep 26, 2025 at 6:09

4 votes

1 answer

143 views

When and how can unsupervised preprocessing before splitting data lead to overoptimistic model performance?

Conceptually, I understand that models should be built totally blind to the test set in order to most faithfully estimate performance on future data. However, I'm struggling to understand the extent ...

Evan

329

asked Jul 30, 2025 at 15:22

1 vote

0 answers

39 views

NIR spectra preprocessing - two point linear baseline correction -OPLS

I am doing an analysis on NIR spectra of which I am trying to measure a physical property which I mostly expect to be scatter. However my samples have a complex surface morphology and I need some ...

phil27

11

asked Jun 26, 2025 at 10:57

1 vote

0 answers

75 views

"How to validate if a dataset has natural clusters?"

I've recently learnt unsupervised learning methods such as KMeans and DBSCAN. While working on this dataset, I applied KMeans clustering but faced the following issues: The Elbow Method showed no ...

ssmalik

41

asked Jun 24, 2025 at 7:43

3 votes

2 answers

601 views

How can I apply KMeans clustering if all variables are highly uncorrelated

I'm applying K-Means clustering to a dataset of ship voyages. The goal is to group voyages into performance-based clusters like cost-efficient, underperforming, etc. I have 12 features in total: 10 ...

ssmalik

41

asked Jun 21, 2025 at 7:50

3 votes

1 answer

210 views

Large errors with log-transformed Gaussian process regression?

I am working with some data in which the output target values $(Y)$ are all strictly positive values, essentially in the range of 0.001 to 100. Since these values can inherently never be negative or ...

Applesauce44

151

asked Feb 17, 2025 at 20:01

1 vote

0 answers

128 views

Modifying Gaussian Processes and/or using transformations for dealing with positive-only output values? [closed]

I've been reading into different Gaussian processes recently to better fit some data that I'm working with. My data clearly does not follow a multivariate Gaussian as required for a standard exact ...

Applesauce44

151

asked Feb 17, 2025 at 1:09

0 votes

1 answer

104 views

Outlier Removal from only One Class in a binary classification problem

Can outlier removal be done only on one class in a binary classification problem? when facing with class imbalance for example, can it be done only on majority class? if so, is there any paper on this ...

vhd

25

asked Feb 14, 2025 at 9:54

6 votes

0 answers

324 views

Reconstructing count table when only pairwise features are visible

Assume we are only able to observe two-way entry table counting the number of observations of a pair of categorical features $x_i,x_j$. $$ \begin{array}{c|ccc} & & x_j & \\ \hline ...

Three Diag

517

asked Feb 7, 2025 at 16:29

1 vote

0 answers

63 views

Neural networks - irregular time shifts of output compared to inputs in given time series data sets

I have some time series data with multiple features. The output is shifted (I mean the times at which I have the output values are shifted from the corresponding inputs and also irregularly). I have ...

Ash Ketchump

11

asked Jan 21, 2025 at 21:56

Stack Exchange Network

Questions tagged [data-preprocessing]

Pre-processing of longitudinal data to convert to a regular survival analysis

Outlier detection in many short time series

How to separate transformation/preprocessing of training and validation datasets in glmnet? [closed]

Why are all my tuned models (DT, GB, SVM) plateauing at ~70% F1 after rigorous data cleaning and feature engineering?

Fitting mixed effect model to factorial survey data

Are there clustering algorithms or preprocessing strategies tailored for zero-inflated and continuous data types?

When and how can unsupervised preprocessing before splitting data lead to overoptimistic model performance?

NIR spectra preprocessing - two point linear baseline correction -OPLS

"How to validate if a dataset has natural clusters?"

How can I apply KMeans clustering if all variables are highly uncorrelated

Large errors with log-transformed Gaussian process regression?

Modifying Gaussian Processes and/or using transformations for dealing with positive-only output values? [closed]

Outlier Removal from only One Class in a binary classification problem

Reconstructing count table when only pairwise features are visible

Neural networks - irregular time shifts of output compared to inputs in given time series data sets

Hot Network Questions