I am using Ordinal Semiparametric Regression (Frank Harrell's rms package) to model overall survival in patients with brain tumor.
My training data is from the SEER database (covering years 2004 to 2022), with 89k datapoints. My test set is an external validation cohort from my local institution, which covers a wider time span (1992-2022), with ~2900 datapoints.
One of my independent variables is the Diagnosis Year (unit is discrete years), which have interactions with other independent variables (e.g., treatment variables). I am planning to model it flexibly using restricted cubic splines rcs(), with the knots placed on years where there have been major shifts in treatment protocols.
I am trying to determine the most statistically rigorous way to handle the Diagnosis Year variable, particularly given the interaction terms and the mismatched time horizons. I have considered a few approaches:
Leave years as raw values: I suspect this is poor practice because the intercept would represent the baseline hazard at year 0.
Shift to zero (e.g., 2004 = 0): Setting the earliest year in the training set to 0. The test set's earlier years (1992-2003) would take on negative values.
Center according to the training set: I lean toward this approach.
Standardize: I prefer to avoid this because stating "a 1 standard deviation increase in diagnosis year" complicates clinical interpretation.
My questions:
Is centering (Approach 3) the mathematically preferred method here to maintain interpretable main effects in the presence of interactions?
How concerning is the backward extrapolation required for my test set?