Questions tagged [data-imputation]
Data imputation is the process of replacing missing data with substituted values. This could involve statistically representative data filling (e.g. local averages) or simply replacing the missing data with encoded values (e.g. replace NaNs with zeros).
133 questions
8
votes
2
answers
87
views
Best practice for handling structured missing data
I'm working with some road traffic accident data and would appreciate advice on how to handle a structured case of missing values.
For context, the features involved are:
...
4
votes
0
answers
46
views
Estimating Final Vehicle Counts from Pairwise Marginals Using Python
I am working with vehicle registration data from website
. The website provides counts for various combinations of vehicle attributes such as Maker, RTO, Fuel, Category, SubCategory, and Emission.
...
0
votes
0
answers
79
views
What is the best practice to impute missing data with patterns over the time? (potential of K-means clustering for imputation of missing values!?)
Years ago, I read in the paper that they proposed a K-means-based approach to impute missing values over energy time data. At the point in time, since I did not have access to that data, I tried to ...
3
votes
2
answers
132
views
How do i fill the Null values of a categorical column?
I'm working on a project using an E-commerce dataset. I'm facing an issue in the data cleaning stage. I have the customers dataset, which has approximately. 1.6 million rows. One of the feature, "...
4
votes
1
answer
73
views
How do outliers affect the process of imputing missing data in categorical variables?
When dealing with missing data in categorical variables, common approaches include imputation by mode or predictive models. However, in some cases, certain categories have extremely low frequency or ...
0
votes
0
answers
51
views
Imputing on Temporal Data
I have a set of non stationary data; where certain features do not have a value. If this is the case, during imputation of these features do I need to ensure that I only use previous data to generate ...
0
votes
0
answers
36
views
How to compare between different ML models for imputation ,If I split data in to train and test dataset?
I have a full dataset and introduce some missingness by one of these type (MCAR,MAR,MNAR) then split data in to train and test dataset after that I impute missing values by using different ML ...
5
votes
1
answer
60
views
Looking to replace missing time series values with values from a competitor that's correlated
I have a dataset of a retailer that has the following attributes
Date, Hour, Enters, Exits
I have another dataset with the same attributes of a competitor that is correlated with the original dataset ...
0
votes
1
answer
32
views
Would imputing using the target variable then analysing correlation between variables be bad due to bias
I have mortality and nutritional data for countries, the mortality data is full for every year but the nutritional data is very limited maybe 2 or 3 years of nutritional data within a 40 year period ...
0
votes
1
answer
78
views
Filling a lot of missing values with arbitrary value
I have a dataset of say 1 million observations. As a silly example, say we want to predict if a person can become a data scientist or not (0/1). I have variables that have a lot of missing values but ...
1
vote
0
answers
43
views
Filling NaNs by mode
I have data with a lot of NaNs:
...
2
votes
0
answers
41
views
Should Imputation Models be Cross Validated
I have a project where I am predicting the best schools based on a series of tests scores, teacher attendance rates, etc. I would like to predict the best school to go to. Some of the data is of ...
0
votes
1
answer
84
views
handling predictions with optional or missing features
We have a few variables that are highly predictive in our modeling task. Is it sound to train models with a superset of features even though some are known NOT to be available at predict time? & ...
2
votes
1
answer
80
views
Change of data shape when using IterativeImputer from sklearn
I am using the IterativeImputer from sklearn and I notice that it changes the data shape. Initially I have an (X,5) array where all columns except for the last one contain the missing value (which has ...
2
votes
0
answers
93
views
Best practices for handling "NA" when all NA values exist due to being below the limit of detection?
I am working in R, and have a data set which has a few metabolite concentration values as continuous variables. Anywhere that the concentration was too low to be detected it simply says <LOD. This ...