Newest 'dataset' Questions - Cross Validated

3 votes

1 answer

89 views

Advice on regression approach

How should I handle a mass-point in the dependent variable when running OLS regression in R? I’m working with a a household expenditure dataset (Living Costs 2019) where the dependent variable is the ...

Jim

31

asked 13 hours ago

5 votes

3 answers

533 views

How to handle outliers when some predictors perform better with them and others without

I’m working on a project where I need to build a predictive model for wine quality based on its chemical properties. The goal is to find which features best explain or predict the quality score. I’ve ...

QualityX

51

asked Oct 8 at 19:23

4 votes

5 answers

704 views

How Do Quartiles Help Us Understand a Dataset?

It’s confusing to understand how quartile values can actually be used to give insights into a dataset. Please assist with examples. I struggle to interpret the values in the context of providing ...

Buchi

41

asked Oct 2 at 17:14

1 vote

0 answers

47 views

Potential CNN Overfitting Due to Limited Training Data

Neural Network Beginner here. I am currently implementing a CNN on PyTorch for recognizing Japanese handwritten letters, which has 46 classes of outputs. I found a dataset on Kaggle https://www.kaggle....

Krish Thyagarajan

11

asked Sep 7 at 16:33

1 vote

0 answers

99 views

Looking for an authentic example with extremely small coefficient of variation [closed]

Out of curiosity, I am looking for an example of an authentic variable (which one would find in a data set) with an exceptionally small coefficient of variation: $\text{CV} = \frac{s}{\bar{x}}$. To ...

Gregg H

7,077

asked Sep 1 at 14:44

1 vote

1 answer

136 views

What is the current consensus on "using test set as training set, post testing"? [duplicate]

This question is inspired by a blog post by https://www.argmin.net/p/in-defense-of-typing-monkeys and several rumors I've heard from other people who works in machine learning. The gist of it is that ...

Your neighbor Todorovich

707

asked Aug 22 at 4:12

1 vote

2 answers

123 views

How can the standard error measure how accurately a sample represents the population, when we don’t have access to the population’s data?

If I got it correct, the standard error is a statistic that measures the variability of a sample’s data and how accurately a statistic represents the corresponding parameter. Please suggest any ...

okman

315

asked Aug 3 at 10:23

0 votes

1 answer

50 views

Theoretical question around Implicit Attitude Test data between timepoints: single vs. multiple datapoints per person?

I have a question that relates to the use of IAT scores across timepoints. As part of a large health-based intervention my colleagues and I have obtained IAT scores at different timepoints, from which ...

Jonathan Kim

11

asked Jul 28 at 21:03

2 votes

1 answer

102 views

Quantitatively determining unexplored parameter spaces [closed]

If we have a high-dimensional dataset (7-10 columns) of continuous variables like Time, Temperature etc. recorded from experiments (not performed by us) are there established methods to quantitatively ...

Sunera Wijeratne

31

asked Jul 23 at 17:14

1 vote

1 answer

94 views

A correct approach to validate/correct readings from similar sensors?

I am looking to apply a calibration/correction approach on a set of sensors and I just wanted to know that the approach I am going to use is statistically correct and acceptable. I am using a set of ...

Milad

157

asked Jul 14 at 11:41

2 votes

1 answer

83 views

Customer propensity: time based split or random split

I have a task: for the store, where customers may pay for their items on registers with cashiers, were added self-service checkouts. I have 4 months of transaction data of customers who make their ...

remon

21

asked Jul 9 at 5:00

0 votes

0 answers

39 views

Is there any standard or common notation for censored values, in data files?

Suppose one must share a data file – could be a simple CSV file – where each datapoint has several variates, let's say a nominal one, an ordinal one, and a continuous-real one. Are there any standard ...

pglpm

1,356

asked Jun 21 at 16:33

0 votes

0 answers

49 views

Question in longitudinal survey is no longer asked. MNAR?

In a longitudinal hospitalization survey dataset, where patients are asked to fill out a survey each time they are admitted into the hospital, one of the questions is no longer asked. This question ...

Kevin

353

asked May 9 at 2:13

1 vote

1 answer

95 views

Why is the Keras MNIST dataset split into training and test samples of lengths 60k and 10k respectively?

The MNIST dataset can be obtained directly using Keras by running the following lines of Python code. ...

user3728501

353

asked May 5 at 12:57

0 votes

0 answers

49 views

Handling Missing Values in the dataset

I'm using this dataset for a regression project, and the goal is to predict the beneficiary risk score(Bene_Avg_Risk_Scre). Now, to protect beneficiary identities and safeguard this information, CMS ...

Anirudh

1

asked Apr 2 at 7:34

Stack Exchange Network

Questions tagged [dataset]

Advice on regression approach

How to handle outliers when some predictors perform better with them and others without

How Do Quartiles Help Us Understand a Dataset?

Potential CNN Overfitting Due to Limited Training Data

Looking for an authentic example with extremely small coefficient of variation [closed]

What is the current consensus on "using test set as training set, post testing"? [duplicate]

How can the standard error measure how accurately a sample represents the population, when we don’t have access to the population’s data?

Theoretical question around Implicit Attitude Test data between timepoints: single vs. multiple datapoints per person?

Quantitatively determining unexplored parameter spaces [closed]

A correct approach to validate/correct readings from similar sensors?

Customer propensity: time based split or random split

Is there any standard or common notation for censored values, in data files?

Question in longitudinal survey is no longer asked. MNAR?

Why is the Keras MNIST dataset split into training and test samples of lengths 60k and 10k respectively?

Handling Missing Values in the dataset

Hot Network Questions