Questions tagged [modeling]
This tag describes the process of creating a statistical or machine learning model. Always add a more specific tag.
335 questions
302
votes
3
answers
37k
views
How to know that your machine learning problem is hopeless?
Imagine a standard machine-learning scenario:
You are confronted with a large multivariate dataset and you have a
pretty blurry understanding of it. What you need to do is to make
predictions ...
138
votes
18
answers
124k
views
Including the interaction but not the main effects in a model
Is it ever valid to include a two-way interaction in a model without including the main effects? What if your hypothesis is only about the interaction, do you still need to include the main effects?
296
votes
13
answers
275k
views
Is there any reason to prefer the AIC or BIC over the other?
The AIC and BIC are both methods of assessing model fit penalized for the number of estimated parameters. As I understand it, BIC penalizes models more for free parameters than does AIC. Beyond a ...
137
votes
5
answers
144k
views
Using k-fold cross-validation for time-series model selection
Question:
I want to be sure of something, is the use of k-fold cross-validation with time series is straightforward, or does one need to pay special attention before using it?
Background:
I'm ...
80
votes
7
answers
56k
views
Do all interactions terms need their individual terms in regression model?
I am actually reviewing a manuscript where the authors compare 5-6 logit regression models with AIC. However, some of the models have interaction terms without including the individual covariate terms....
91
votes
14
answers
73k
views
What is the meaning of "All models are wrong, but some are useful"
"Essentially, all models are wrong, but some are useful."
--- Box, George E. P.; Norman R. Draper (1987). Empirical Model-Building and Response Surfaces, p. 424, Wiley. ISBN 0471810339.
What ...
83
votes
4
answers
29k
views
Why does including latitude and longitude in a GAM account for spatial autocorrelation?
I have produced generalized additive models for deforestation. To account for spatial-autocorrelation, I have included latitude and longitude as a smoothed, interaction term (i.e. s(x,y)).
I've based ...
63
votes
4
answers
101k
views
How does linear regression use the normal distribution?
In linear regression, each predicted value is assumed to have been picked from a normal distribution of possible values. See below.
But why is each predicted value assumed to have come from a normal ...
6
votes
1
answer
3k
views
Separate Models vs Flags in the same model
I have customer data from 2 brands. The data structure are the same, but I expected the customer behaviour to be different in different brand.
So I could train 2 models, 1 for each brand, or I could ...
70
votes
3
answers
36k
views
Variables are often adjusted (e.g. standardised) before making a model - when is this a good idea, and when is it a bad one?
In what circumstances would you want to, or not want to scale or standardize a variable prior to model fitting? And what are the advantages / disadvantages of scaling a variable?
35
votes
3
answers
13k
views
Why is variable selection necessary?
Common data-based variable selection procedures (for example, forward, backward, stepwise, all subsets) tend to yield models with undesirable properties, including:
Coefficients biased away from zero.
...
18
votes
1
answer
5k
views
Ratios in Regression, aka Questions on Kronmal
Recently, randomly browsing questions triggered a memory of on off-hand comment from one of my professors a few years back warning about the usage of ratios in regression models. So I started reading ...
84
votes
6
answers
13k
views
Variable selection for predictive modeling really needed in 2016?
This question has been asked on CV some yrs ago, it seems worth a repost in light of 1) order of magnitude better computing technology (e.g. parallel computing, HPC etc) and 2) newer techniques, e.g. [...
73
votes
7
answers
112k
views
What is a "saturated" model?
What is meant when we say we have a saturated model?
16
votes
2
answers
4k
views
What distribution to use to model time before a train arrives?
I'm trying to model some data on train arrival times. I'd like to use a distribution that captures "the longer I wait, the more likely the train is going to show up". It seems like such a distribution ...