3
$\begingroup$

I���m fitting a regression model with several continuous predictors, and I suspect some of them may have nonlinear effects (yeah, yeah, Box-Tidwell may answer that).

Is it reasonable to use AIC to choose, for each predictor, whether to model it linearly or with a spline, and if using a spline, how many degrees of freedom to use? Then use those choices in the final model (i.e., try to identify the combination of spline terms across the continuous predictors that yields the lowest overall AIC)? Or is that not a good approach?

More generally, how are spline terms usually chosen in practice in a multivariable regression setting?

$\endgroup$
7
  • 1
    $\begingroup$ What do you see as special about splines that distinguishes them in this respect from any other kinds of explanatory variables in regression? $\endgroup$ Commented Mar 12 at 17:49
  • 3
    $\begingroup$ Is your objective hypothesis testing or prediction? If you feed the final (AIC-chosen) model into the standard NHST machinery, note that p values will be biased low. Ideally you would determine the model based on one data set and assess significance on a different one. (I know, easier said than done.) And just as @whuber writes, this holds for splines just like for other predictors. $\endgroup$ Commented Mar 12 at 20:11
  • 2
    $\begingroup$ In all linear models, predictors are vector spaces. That's what splines are, too. This makes them statistically and mathematically the same as any other predictor in any linear model. You can see this by examining the model matrices your software creates: the columns of these matrices generate the subspaces. Indeed, the very fact that your software operates by creating a model matrix demonstrates my claim. $\endgroup$ Commented Mar 12 at 21:31
  • 1
    $\begingroup$ "if using a spline, how many degrees of freedom to use?" If you use a mgcv::gam smoother instead, the penalization takes care of this. $\endgroup$ Commented Mar 13 at 6:27
  • 2
    $\begingroup$ My answer assumed that you were asking about restricted cubic regression splines, but I particularly like Gavin Simpson's answer with its extension to the broader class of generalized additive models (GAM). Based on questions I see on this site, I suspect that the other types of GAM smoothers/splines he discusses might be more "usually chosen in practice" these days. I'd recommend that you accept his answer as a better guide for future visitors to this page. See this page for an introduction to different types of splines. $\endgroup$ Commented Mar 13 at 13:05

3 Answers 3

8
$\begingroup$

Frank Harrell's Regression Modeling Strategies (RMS) covers this. Chapter 2 (esp. Section 2.4) discusses splines, and Chapter 4 includes splines in a general strategy for modeling.

A good approach is to determine how many degrees of freedom you can afford to use for model parameters, based on your sample size and the type of regression (OLS, binomial, survival...); see Section 4.4 of RMS. Next decide on how many degrees of freedom to allow for each predictor. For example, if you are limited in the available degrees of freedom you might choose to spend fewer of them on a continuous predictor you are trying to "control for" than on a predictor of primary interest.

See Section 2.4.6 of RMS for guidance on how many knots (typically only 3, 4, or 5, depending on the number of cases and the flexibility you want) and where to place them ("Place knots where data exist — fixed quantiles of predictor’s marginal distribution"). As noted there some do recommend using AIC to choose the number of knots, but pre-specifying the knot number based on the number of observations and your desired flexibility would lead to fewer complaints about p-value bias from building a model based on the results of prior modeling.

If you include interactions involving predictors modeled with splines you can quickly increase the number of parameters that you need to estimate. Sometimes it then makes sense to restrict such interactions to linear terms.

$\endgroup$
7
$\begingroup$

Generalized additive models with automatic smoothness selection take some of the guess-work out of needing to specify exactly the number of degrees of freedom you expect $f(x)$ to take/use.

The idea is that the user specifies that the smooth function $f(x)$ should be represented in the model with $K$ basis functions. $K$ represents the upper limit on expected wiggliness of the true (but unknown) function. Then a wiggliness penalty is used to balance the fit of the data with the complexity of the function. A smoothing parameter, $\lambda$, is introduced as a hyperparameter that is optimised during model fitting, wherein we find estimates of the model coefficients (for any parametric terms plus the coefficients for the basis functions of the spline) that maximise the penalised log likelihood

$$ \mathcal{L}_p(\boldsymbol{\beta}) = \mathcal{L}(\boldsymbol{\beta}) + \frac{1}{2} \lambda \boldsymbol{\beta}^{\mathsf{T}}\mathbf{S}\boldsymbol{\beta} $$

where $\mathcal{L}$ is the log-likelihood of the data and $\mathcal{L}_p$ the penalised log-likelihood given $\boldsymbol{\beta}$, $\boldsymbol{\beta}^{\mathsf{T}}\mathbf{S}\boldsymbol{\beta}$ is the wiggliness penalty, and $\lambda$ is the smoothing parameter that controls the fit-to-data vs complexity tradeoff.

By default, $\boldsymbol{\beta}^{\mathsf{T}}\mathbf{S}\boldsymbol{\beta}$, the wiggliness penalty is a measure of the squared second derivative of the function over the range of $x$.

We would typically choose $\lambda$ with REML or ML smoothness selection, but generalised cross validation (GCV) or AIC are other options, among many criteria available depending on the specific software being used.

The point of all this is that, to some extent, you can relax a little about getting the right degrees of freedom for the spline term(s). What is important here (and in splines in general), is that the basis for $f(x)$ is rich enough to contain the true function or a close approximation to $f(x)$. The general advice is to set $K$, the number of basis functions for the smooth $f(x)$, a little larger than the expected wiggliness of the true but unknown function: say you think $f$ uses ~5 degrees of freedom, there are more functions of ~5 EDF contained in a basis of size 7 or 8, than one of size 5. Hence, one might set $K = 8$ and then allow the wiggliness penalty to shrink away some of the unneeded wiggliness implied by 8 basis functions.

Of course, like anything in statistics, there is no free lunch; automatic smoothness selection methods are not infallible, and care needs to be taken when specifying $K$. If there is unmodelled autocorrelation in the response, or other clustering, things can go wrong (so tell the model about these features of the data(!) or use something like neighbourhood cross validation (NCV) to do the smoothness selection). Likewise, if you know the function has to be monotonic but don't enforce this in the model, $\hat{f}(x)$ may be too wiggly: again, a solution is to tell the model about this and enforce shape constraints on the $\boldsymbol{\beta}$.

Likewise, setting $K$ really large and hoping the smoothness selection will fix the problem so that you don't actually have to think. All this will do is

  1. make you model fitting take much longer because there are many $\boldsymbol{\beta}$ to keep track of, and
  2. increase the risk that the model adapts to the specific sample of data you have observed, reducing its effectiveness as a model for unseen data and suggesting the relationship between $y$ and $x$ is more complex than it truly is.

There are other modelling choices you often need to make with splines, like choosing where to place the knots. This can avoided if you use a low-rank eigen basis such as the low-rank thin plate regression spline basis of Wood (2003), which are the default in the mgcv package for R, which is a popular, performant, and extensive suite of functions for fitting GAMs and related models.

Using AIC in the manner you mention is going to get problematic if want to use the model for more than prediction: if you are interested in the p values of the estimated smooth(s) or effects, this form of model selection is going to make a complete mess of things. Simon Wood and others have developed theory that allows computing p values for smooth subject to automatic smoothness selection that have approximately the right type I error rate, so the automatic selection I describe above still can result in p values you can use if you really must use p value. Not so for the kind of AIC selection you mention.

A further problem is that dealing with each covariate in turn is going to be suboptimal in the sense that reducing the degrees of freedom for $x_2$ might mean $x_1$ should now use more degrees of freedom than you initially worked out when you looked at $x_1$ without fiddling with $x_2$. Automatic smoothness selection methods avoid this problem.

In summary, in a GAM, the first choice is to decide whether the relationship between a covariate and the response is potentially non-linear. If so, represent that relationship as a spline, and choose the number of basis functions, $K$, to use for that spline, using the points I mention above. Then fit the model using REML or ML smoothness selection (or NCV if $y$ is autocorrelated). Then you do the usual model checks, plus you need to check that your choice(s) for $K$ were OK, and there are heuristic tests for this, e.g. Pya & Wood (206), as implemented in mgcv::k.check() for example, and other ways are possible too by fitting the model to the residuals replacing $y$ and using larger $K$.

I somewhat disagree with Peter Flom's comment

Yes, they [splines] … can be hard to interpret

While yes, the effect of $x$ in $f(x)$ can't be summarised in a single coefficient, but the component functions of a GAM can be plotted so we can visualise exactly how the response changes as a function of $x$ while holding all other effects constant.

References

Pya, N., Wood, S.N., 2016. A note on basis dimension selection in generalized additive modelling. arXiv [stat.ME].

Wood, S.N., 2003. Thin plate regression splines: J. R. Stat. Soc. Series B Stat. Methodol. 65, 95–114. https://doi.org/10.1111/1467-9868.00374

$\endgroup$
3
$\begingroup$

Is it reasonable to use AIC? Yes. Many people do.

Is it ideal? No, I don't think so. i think that, if you believe there are nonlinearities, using splines of some sort is a good base. Yes, they do use more df and can be hard to interpret, but using them when the relationship is actually close to linear doesn't obscure anything, or violate any rules.

$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.