Questions tagged [large-data]

Ask Question

'Large data' refers to situations where the number of observations (data points) is so large that it necessitates changes in the way the data analyst thinks about or conducts the analysis. (Not to be confused with 'high dimensionality'.)

568 questions

3 votes

0 answers

78 views

Why does a random term in a large GAMM model make the curves spiky and wiggly?

I am creating several GAMM models with similar structures to dynamically model an acoustic parameter across realizations from multiple subjects, who are included as random smooths in my models. In the ...

Jul2415

asked Nov 8, 2024 at 23:00

-1 votes

1 answer

106 views

Packages for Big Panel Data Regression in R [closed]

Is it possible to run a regression on a panel data set with 10,000 objects ($N$), each with $2,000$ observations (time series length $T$)? If so, what package in R can handle this?

Dane

asked Sep 26, 2024 at 16:47

2 votes

1 answer

615 views

Missing data imputation in longitudinal data in R

I have very big dataset (around 10 million rows) with repeated measures of around 500 000 individuals, irregularly spaced through time. My final goal is to do IPTW and fit a weighted cox regression ...

Tasosmav

asked Sep 17, 2024 at 13:16

3 votes

1 answer

694 views

Dealing with non-proportional hazards in a Cox model with many variables and a large dataset

Summary: I am trying deal with non-proportional hazards in a Cox model on a large dataset. My question is whether the proportional hazards assumption really does not hold? If no, is the second model ...

Thomas

asked Jul 17, 2024 at 14:28

2 votes

0 answers

51 views

Using whole training set for choosing model

I am working on a classification problem with what I understand as a big dataset. I have first of all splitted it in my "train" dataset and the "test" one. (Actually I am convinced ...

Videgain

asked Jul 11, 2024 at 18:34

0 votes

1 answer

126 views

How to easily convert frequency data into raw data (large dataset) for t-test? [closed]

Statistics goal: Determine if the difference between two datasets is statistically significant. Dataset description: The data is available in the form of particle size (mm) v. particle count (...

Dana Tran

asked May 2, 2024 at 4:31

6 votes

1 answer

625 views

Can Wilcoxon be used in large sample with non-normal distribution?

I am doing my undergrad research, aiming to know the difference of before and after an intervention. our sample size is 37 which is already considered as a large sample right? However, when we test ...

Chilenesa

asked Apr 24, 2024 at 3:45

5 votes

2 answers

561 views

Detecting interactions in large logistic regression models

I have a dataset of a few million observations of a binary response with a low "Success"-probability of on average 1% to 2%. The dataset encompasses several categorical (~20 some with up to ...

g g

2,954

asked Apr 21, 2024 at 11:22

1 vote

0 answers

73 views

Exploratory Factor analyses on large data sets

I have a question about using EFA on a large data set of survey questions. The goal is to form an index from over 200 items, and partly also as a form of dimension reduction (i understand PCA is also ...

Ewen Tan

asked Apr 14, 2024 at 11:59

1 vote

0 answers

39 views

Trust the graphs or go with Breusch-Pagan and White's tests for Homoscedasticity on large datasets? [duplicate]

I have a large dataset (n > 500,000) which I'm building a linear model with lm(PV1READ ~ PV1MATH + PV1SCIE + ST004D01T). Tests for Normality, No ...

pluke

asked Mar 19, 2024 at 17:46

0 votes

1 answer

75 views

Decomposition of VAR(1) coefficient matrix

Consider the VAR(1) process $X_t = \Phi X_{t-1} + \epsilon_t.$ Is there a generally accepted decomposition for the coefficient matrix $\Phi$ that would decrease the degrees of freedom? My initial ...

Ville

asked Mar 19, 2024 at 11:27

2 votes

0 answers

114 views

In the mgcv ::bam function in R, how can I constrain a two dimensional smooth to be monotonically increasing in both dimensions for large data?

I have a large dataset (1.3M rows) where I want to ensure that both Age and Duration increase monotonically for each by factor level (Male, Female). Here is the setup of the model: ...

Colin

asked Mar 12, 2024 at 13:38

0 votes

1 answer

88 views

Clustering of large text datasets with unknown number of clusters

I have a list of hotel names which may or may not be correct, and with different spellings (such as '&' instead of 'and'). I want to use clustering in order to group the hotels with different ...

user480840

asked Feb 20, 2024 at 21:17

6 votes

2 answers

439 views

Using a t-test to test effect size

In my line of work, I work with large data and often run stat tests to compare differences between groups. The problem I am facing is that if I use a $t$-test to measure any difference, the result ...

baz

asked Jan 10, 2024 at 3:17

2 votes

2 answers

147 views

p>>>n problem how to navigate?

I have a DNA methlation data for 32 samples. For each sample I have DNA methylation avaialble for >10000's of cpg bases (ie C nucleotides on DNA). I also have gene expression data from which I have ...

Saad Khan

asked Jan 3, 2024 at 22:06

15 30 50 per page

2 3 4 5

…

38 Next

Stack Exchange Network

Questions tagged [large-data]

Why does a random term in a large GAMM model make the curves spiky and wiggly?

Packages for Big Panel Data Regression in R [closed]

Missing data imputation in longitudinal data in R

Dealing with non-proportional hazards in a Cox model with many variables and a large dataset

Using whole training set for choosing model

How to easily convert frequency data into raw data (large dataset) for t-test? [closed]

Can Wilcoxon be used in large sample with non-normal distribution?

Detecting interactions in large logistic regression models

Exploratory Factor analyses on large data sets

Trust the graphs or go with Breusch-Pagan and White's tests for Homoscedasticity on large datasets? [duplicate]

Decomposition of VAR(1) coefficient matrix

In the mgcv ::bam function in R, how can I constrain a two dimensional smooth to be monotonically increasing in both dimensions for large data?

Clustering of large text datasets with unknown number of clusters

Using a t-test to test effect size

p>>>n problem how to navigate?

Hot Network Questions