Questions tagged [dataset]
Requests for datasets are off-topic on this site. Use this tag for questions concerning creating, processing, or maintaining datasets.
1,934 questions
3
votes
1
answer
89
views
Advice on regression approach
How should I handle a mass-point in the dependent variable when running OLS regression in R?
I’m working with a a household expenditure dataset (Living Costs 2019) where the dependent variable is the ...
5
votes
3
answers
533
views
How to handle outliers when some predictors perform better with them and others without
I’m working on a project where I need to build a predictive model for wine quality based on its chemical properties. The goal is to find which features best explain or predict the quality score.
I’ve ...
4
votes
5
answers
704
views
How Do Quartiles Help Us Understand a Dataset?
It’s confusing to understand how quartile values can actually be used to give insights into a dataset. Please assist with examples. I struggle to interpret the values in the context of providing ...
1
vote
0
answers
47
views
Potential CNN Overfitting Due to Limited Training Data
Neural Network Beginner here. I am currently implementing a CNN on PyTorch for recognizing Japanese handwritten letters, which has 46 classes of outputs.
I found a dataset on Kaggle https://www.kaggle....
1
vote
0
answers
99
views
Looking for an authentic example with extremely small coefficient of variation [closed]
Out of curiosity, I am looking for an example of an authentic variable (which one would find in a data set) with an exceptionally small coefficient of variation: $\text{CV} = \frac{s}{\bar{x}}$. To ...
1
vote
1
answer
136
views
What is the current consensus on "using test set as training set, post testing"? [duplicate]
This question is inspired by a blog post by https://www.argmin.net/p/in-defense-of-typing-monkeys and several rumors I've heard from other people who works in machine learning.
The gist of it is that ...
1
vote
2
answers
123
views
How can the standard error measure how accurately a sample represents the population, when we don’t have access to the population’s data?
If I got it correct, the standard error is a statistic that measures the variability of a sample’s data and how accurately a statistic represents the corresponding parameter.
Please suggest any ...
0
votes
1
answer
50
views
Theoretical question around Implicit Attitude Test data between timepoints: single vs. multiple datapoints per person?
I have a question that relates to the use of IAT scores across timepoints. As part of a large health-based intervention my colleagues and I have obtained IAT scores at different timepoints, from which ...
2
votes
1
answer
102
views
Quantitatively determining unexplored parameter spaces [closed]
If we have a high-dimensional dataset (7-10 columns) of continuous variables like Time, Temperature etc. recorded from experiments (not performed by us) are there established methods to quantitatively ...
1
vote
1
answer
94
views
A correct approach to validate/correct readings from similar sensors?
I am looking to apply a calibration/correction approach on a set of sensors and I just wanted to know that the approach I am going to use is statistically correct and acceptable.
I am using a set of ...
2
votes
1
answer
83
views
Customer propensity: time based split or random split
I have a task: for the store, where customers may pay for their items on registers with cashiers, were added self-service checkouts. I have 4 months of transaction data of customers who make their ...
0
votes
0
answers
39
views
Is there any standard or common notation for censored values, in data files?
Suppose one must share a data file – could be a simple CSV file – where each datapoint has several variates, let's say a nominal one, an ordinal one, and a continuous-real one.
Are there any standard ...
0
votes
0
answers
49
views
Question in longitudinal survey is no longer asked. MNAR?
In a longitudinal hospitalization survey dataset, where patients are asked to fill out a survey each time they are admitted into the hospital, one of the questions is no longer asked. This question ...
1
vote
1
answer
95
views
Why is the Keras MNIST dataset split into training and test samples of lengths 60k and 10k respectively?
The MNIST dataset can be obtained directly using Keras by running the following lines of Python code.
...
0
votes
0
answers
49
views
Handling Missing Values in the dataset
I'm using this dataset for a regression project, and the goal is to predict the beneficiary risk score(Bene_Avg_Risk_Scre). Now, to protect beneficiary identities and safeguard this information, CMS ...