Questions tagged [data-preprocessing]
A step of cleaning data in data mining for analysis purposes
527 questions
0
votes
0
answers
41
views
Outlier detection in many short time series
I have a dataset with ~20.000 entries containing mean values for different groups. The groups are defined with 4 categorical columns and I have the week number, the number of samples per week and the ...
1
vote
0
answers
23
views
How to separate transformation/preprocessing of training and validation datasets in glmnet? [closed]
The cross-validation function cv.glmnet, for regularized regression, does not seem to allow for separate transformation/preprocessing of training and validation ...
3
votes
1
answer
32
views
Why are all my tuned models (DT, GB, SVM) plateauing at ~70% F1 after rigorous data cleaning and feature engineering?
I'm working on a classification problem where the goal is to maximize the F1-score, hopefully above 80%. Despite a very thorough EDA and preprocessing workflow, I've hit a hard performance ceiling ...
0
votes
0
answers
77
views
Fitting mixed effect model to factorial survey data
I am currently conducting an online survey in a factorial setting ("vignette study"). I have 8 vignettes in total, varying in three dimensions (let us call them Dimension A, Dimension B and ...
0
votes
0
answers
44
views
Are there clustering algorithms or preprocessing strategies tailored for zero-inflated and continuous data types?
I am currently working on the project where I need to assign customers across N recipes before AB testing such that KPIs for each customer are balanced across recipes (reduce pre-test bias)
Dataset ...
4
votes
1
answer
133
views
When and how can unsupervised preprocessing before splitting data lead to overoptimistic model performance?
Conceptually, I understand that models should be built totally blind to the test set in order to most faithfully estimate performance on future data. However, I'm struggling to understand the extent ...
1
vote
0
answers
36
views
NIR spectra preprocessing - two point linear baseline correction -OPLS
I am doing an analysis on NIR spectra of which I am trying to measure a physical property which I mostly expect to be scatter.
However my samples have a complex surface morphology and I need some ...
1
vote
0
answers
72
views
"How to validate if a dataset has natural clusters?"
I've recently learnt unsupervised learning methods such as KMeans and DBSCAN.
While working on this dataset, I applied KMeans clustering but faced the following issues: The Elbow Method showed no ...
3
votes
2
answers
576
views
How can I apply KMeans clustering if all variables are highly uncorrelated
I'm applying K-Means clustering to a dataset of ship voyages. The goal is to group voyages into performance-based clusters like cost-efficient, underperforming, etc.
I have 12 features in total:
10 ...
3
votes
1
answer
189
views
Large errors with log-transformed Gaussian process regression?
I am working with some data in which the output target values $(Y)$ are all strictly positive values, essentially in the range of 0.001 to 100. Since these values can inherently never be negative or ...
1
vote
0
answers
111
views
Modifying Gaussian Processes and/or using transformations for dealing with positive-only output values? [closed]
I've been reading into different Gaussian processes recently to better fit some data that I'm working with. My data clearly does not follow a multivariate Gaussian as required for a standard exact ...
0
votes
1
answer
98
views
Outlier Removal from only One Class in a binary classification problem
Can outlier removal be done only on one class in a binary classification problem?
when facing with class imbalance for example, can it be done only on majority class?
if so, is there any paper on this ...
6
votes
0
answers
320
views
Reconstructing count table when only pairwise features are visible
Assume we are only able to observe two-way entry table counting the number of observations of a pair of categorical features $x_i,x_j$.
$$
\begin{array}{c|ccc}
& & x_j & \\
\hline
...
1
vote
0
answers
60
views
Neural networks - irregular time shifts of output compared to inputs in given time series data sets
I have some time series data with multiple features. The output is shifted (I mean the times at which I have the output values are shifted from the corresponding inputs and also irregularly). I have ...
1
vote
0
answers
41
views
Preprocess two different kind of datasets for a machine learning problem
I am working on two health-related datasets. And I use Python.
One tabular dataset (called A) contains patient-level information (by id) and a bunch of other features which I have already transformed ...