Questions tagged [large-data]
'Large data' refers to situations where the number of observations (data points) is so large that it necessitates changes in the way the data analyst thinks about or conducts the analysis. (Not to be confused with 'high dimensionality'.)
568 questions
3
votes
0
answers
78
views
Why does a random term in a large GAMM model make the curves spiky and wiggly?
I am creating several GAMM models with similar structures to dynamically model an acoustic parameter across realizations from multiple subjects, who are included as random smooths in my models.
In the ...
-1
votes
1
answer
106
views
Packages for Big Panel Data Regression in R [closed]
Is it possible to run a regression on a panel data set with 10,000 objects ($N$), each with $2,000$ observations (time series length $T$)? If so, what package in R can handle this?
2
votes
1
answer
615
views
Missing data imputation in longitudinal data in R
I have very big dataset (around 10 million rows) with repeated measures of around 500 000 individuals, irregularly spaced through time. My final goal is to do IPTW and fit a weighted cox regression ...
3
votes
1
answer
694
views
Dealing with non-proportional hazards in a Cox model with many variables and a large dataset
Summary: I am trying deal with non-proportional hazards in a Cox model on a large dataset. My question is whether the proportional hazards assumption really does not hold? If no, is the second model ...
2
votes
0
answers
51
views
Using whole training set for choosing model
I am working on a classification problem with what I understand as a big dataset. I have first of all splitted it in my "train" dataset and the "test" one. (Actually I am convinced ...
0
votes
1
answer
126
views
How to easily convert frequency data into raw data (large dataset) for t-test? [closed]
Statistics goal: Determine if the difference between two datasets is statistically significant.
Dataset description: The data is available in the form of particle size (mm) v. particle count (...
6
votes
1
answer
625
views
Can Wilcoxon be used in large sample with non-normal distribution?
I am doing my undergrad research, aiming to know the difference of before and after an intervention.
our sample size is 37 which is already considered as a large sample right?
However, when we test ...
5
votes
2
answers
561
views
Detecting interactions in large logistic regression models
I have a dataset of a few million observations of a binary response with a low "Success"-probability of on average 1% to 2%. The dataset encompasses several categorical (~20 some with up to ...
1
vote
0
answers
73
views
Exploratory Factor analyses on large data sets
I have a question about using EFA on a large data set of survey questions. The goal is to form an index from over 200 items, and partly also as a form of dimension reduction (i understand PCA is also ...
1
vote
0
answers
39
views
Trust the graphs or go with Breusch-Pagan and White's tests for Homoscedasticity on large datasets? [duplicate]
I have a large dataset (n > 500,000) which I'm building a linear model with lm(PV1READ ~ PV1MATH + PV1SCIE + ST004D01T). Tests for Normality, No ...
0
votes
1
answer
75
views
Decomposition of VAR(1) coefficient matrix
Consider the VAR(1) process $X_t = \Phi X_{t-1} + \epsilon_t.$
Is there a generally accepted decomposition for the coefficient matrix $\Phi$ that would decrease the degrees of freedom?
My initial ...
2
votes
0
answers
114
views
In the mgcv ::bam function in R, how can I constrain a two dimensional smooth to be monotonically increasing in both dimensions for large data?
I have a large dataset (1.3M rows) where I want to ensure that both Age and Duration increase monotonically for each by factor level (Male, Female). Here is the setup of the model:
...
0
votes
1
answer
88
views
Clustering of large text datasets with unknown number of clusters
I have a list of hotel names which may or may not be correct, and with different spellings (such as '&' instead of 'and'). I want to use clustering in order to group the hotels with different ...
6
votes
2
answers
439
views
Using a t-test to test effect size
In my line of work, I work with large data and often run stat tests to compare differences between groups. The problem I am facing is that if I use a $t$-test to measure any difference, the result ...
2
votes
2
answers
147
views
p>>>n problem how to navigate?
I have a DNA methlation data for 32 samples. For each sample I have DNA methylation avaialble for >10000's of cpg bases (ie C nucleotides on DNA). I also have gene expression data from which I have ...