How to handle calendar year as a continuous predictor with a mismatched train/test time horizon?

Question

I am using Ordinal Semiparametric Regression (Frank Harrell's rms package) to model overall survival in patients with brain tumor.

My training data is from the SEER database (covering years 2004 to 2022), with 89k datapoints. My test set is an external validation cohort from my local institution, which covers a wider time span (1992-2022), with ~2900 datapoints.

One of my independent variables is the Diagnosis Year (unit is discrete years), which have interactions with other independent variables (e.g., treatment variables). I am planning to model it flexibly using restricted cubic splines rcs(), with the knots placed on years where there have been major shifts in treatment protocols.

I am trying to determine the most statistically rigorous way to handle the Diagnosis Year variable, particularly given the interaction terms and the mismatched time horizons. I have considered a few approaches:

Leave years as raw values: I suspect this is poor practice because the intercept would represent the baseline hazard at year 0.
Shift to zero (e.g., 2004 = 0): Setting the earliest year in the training set to 0. The test set's earlier years (1992-2003) would take on negative values.
Center according to the training set: I lean toward this approach.
Standardize: I prefer to avoid this because stating "a 1 standard deviation increase in diagnosis year" complicates clinical interpretation.

My questions:

Is centering (Approach 3) the mathematically preferred method here to maintain interpretable main effects in the presence of interactions?
How concerning is the backward extrapolation required for my test set?

How much more "interpretable" do you find centered data to be compared to any of the other options? Mathematically (and statistically) those options are all equivalent. Setting an origin somewhere around 2000 is a slight improvement numerically when computing in floating point arithmetic, but it should make no discernible difference here. — whuber
– whuber ♦, Commented 9 hours ago
You are right that the models are mathematically equivalent in terms of fit and predictions. My concern regarding interpretability is specifically about the main effects in the presence of my interaction terms. If I leave the year as raw values, the main effect coefficient for my treatment variables would represent the treatment effect in the year 0 (which is arbitrary). By centering on the median training year, the main effect coefficient becomes the treatment effect at that specific, clinically relevant time point (as recommended by Aiken and West 1991 book Chapter 3). — Çağan Kaplan
– Çağan Kaplan, Commented 8 hours ago

EdM · Accepted Answer · 2026-03-31 19:45:42Z

Is centering ... the mathematically preferred method here to maintain interpretable main effects in the presence of interactions?

Frank Harrell has said

I almost never use centering, finding it completely unnecessary and confusing.

Centering doesn't mathematically change the fundamental model. In some cases it does improve numerical stability. For example, the coxph() function in the R survival package, silently centers and scales data internally for that purpose but reports coefficients appropriate for the location and scale in which the data were presented.

Yes, an intercept value can be easier to interpret directly if you center a continuous predictor whose typical values are far from 0. With centering, "main effect" coefficients in a model with interactions can also be closer to what you would have in a realistic scenario. There's no need, however, to focus on reported coefficient estimates when there are so many helpful tools, including those in the rms package, to provide model predictions for any scenario of practical interest based on those estimates.

How concerning is the backward extrapolation required for my test set?

Very, for a couple of reasons.

First, if you are using restricted cubic splines to model Year, your backward extrapolation will be based on a linear slope that matches the slope, at the first knot, of the first cubic segment of the spline. There's no assurance that extrapolation would work well before the 2004 start time of the SEER data, even if the fundamental clinical situation was the same as for the SEER data.

Second, I suspect that clinical practice and outcomes for brain tumors could have improved in the dozen years between the 1992 start of your test data and the 2004 start of the SEER training data. In that case, even if linear extrapolation from your SEER data model to earlier times properly handled prior cases with those later clinical practices and outcomes, the model itself would not be applicable to earlier cases that didn't have the advantage of those later improvements.

I'd recommend limiting your test data to the time period that matches the SEER data.

Thank you for your answer. I am content with using raw years if that is the accepted approach thanks to rms tools. However I still do not completely understand why Aiken and West recommend centering the continuous variable in the case of an interaction in chapter 3 of their 1991 book, even though both centered and uncentered models are mathematically equivalent. Regarding the earlier years in the test set, if we set them to the last year available in the training set, would this approach let us keep the earlier datapoints via bypassing the backwards extrapolation? Or would this introduce bias? — Çağan Kaplan
– Çağan Kaplan, Commented 8 hours ago
@ÇağanKaplan without centering, calculations for interaction (product) terms can involve fairly extreme numeric values, which was perhaps more of a practical issue in 1991 than it is today. It would be dangerous to use your pre-2004 test data to evaluate your SEER model. Setting the corresponding dates of earlier test cases to 2004 (as you seem to propose) would assume that the clinical/outcome situation for those earlier cases was the same as in 2004. That seems unlikely. as IMRT started being widely adopted in the late 1990s. — EdM
– EdM, Commented 7 hours ago

Stack Exchange Network

How to handle calendar year as a continuous predictor with a mismatched train/test time horizon?

1 Answer 1

Linked

Hot Network Questions

How to handle calendar year as a continuous predictor with a mismatched train/test time horizon?

1 Answer 1

Linked

Related

Hot Network Questions