I came across the article from Bender et al.(2005) and attempted to put this into R code to simulate survival times based on an empirical baseline hazard from existing data.
I compute survival times based on equation 6 that transforms "uniformly distributed random numbers [...] into survival times following a specific Cox model [where it] is just required to insert the inverse of an appropriate cumulative baseline hazard function into equation (6)" (Bender et al.,2005, p. 1716).
Said equation is: $T = H^{−1}_0 [− log(U ) exp(− \beta′x)]$
Considering a random effect (re) and one predictor (x), I coded:
# calculate the linear predictor for each observation
linpred <- beta * x + e
# simulate event times using the inverse transform sampling method
u <- runif(n_total)
target_h <- -log(u) / exp(linpred)
# Interpolate: find t such that H(t) = target_h
h_to_t <- approxfun(
x = cum_baseline_hazard$hazard,
y = cum_baseline_hazard$time,
method = "linear",
rule = 2, # clamp to range
ties = "mean"
)
# generate survival times
T = h_to_t(target_h)
where the cum_baseline_hazard object is computed from real data.
This is considerebly faster then using the simsuv() function from the same-named package. For a scenatio of 100 clusters, 10 observations per cluster, a fixed effect of 0.2, and a standard deviation of the random effect of 0.5, my way of computing the survival times takes 0.002 seconds and recovered the true effect quite well (estimated effect = 0.2061), whereas simsurv() took 13.76 second (with a well recovered estimated effect as well, 0.212).
Is my translation of the equation correct?
For a replication of those results and an example (with no actually real data though), see my code below.
library(tidyverse)
library(survival)
library(simsurv)
set.seed(42)
n_cluster <- 100
n_obs_per_cluster <- 10
n_total <- n_cluster * n_obs_per_cluster
fixed_effect <- 0.2
random_effect_sd <- 0.5
max_time <- 10
# assume this would be real data
true_data <- tibble(
id = rep(1:n_cluster, each = n_obs_per_cluster),
x = rnorm(n_total)
) |>
mutate(
eventtime = rexp(
n_total,
rate = exp(fixed_effect * x + rnorm(n_cluster, sd = random_effect_sd)[id])
),
censor_time = runif(n_total, 0, max_time), # or fixed max, or rexp for random censoring
eventtime = pmin(eventtime, censor_time),
status = as.integer(eventtime <= censor_time)
)
# I fit the model to my true data
m_true <- coxph(
Surv(eventtime, status) ~ x + frailty(id, distribution = "gaussian"),
data = true_data
)
# Then I extract the cumulative bazeline hazards:
cum_baseline_hazard <- basehaz(m_true)
head(cum_baseline_hazard)
# hazard time
# 1 0.000988 0.0005937
# 2 0.001977 0.0016313
# 3 0.002967 0.0038257
# 4 0.003958 0.0039334
# 5 0.004950 0.0040876
# 6 0.005944 0.0042212
# Now, I simlulate data based on the parameters above and based on the # inverse transform sampling method.
# actual simulation
sim_data <- tibble(
id = rep(1:n_cluster, each = n_obs_per_cluster),
x = rnorm(n_total),
re = rnorm(n_cluster, mean = 0, sd = random_effect_sd)[id]
)
# calculate the linear predictor for each observation
linpred <- fixed_effect * sim_data$x + sim_data$re
# simulate event times using the inverse transform sampling method
u <- runif(n_total)
target_h <- -log(u) / exp(linpred)
# Interpolate: find t such that H(t) = target_h
h_to_t <- approxfun(
x = cum_baseline_hazard$hazard,
y = cum_baseline_hazard$time,
method = "linear",
rule = 2, # clamp to range
ties = "mean"
)
systime_simHand <- system.time(sim_data1 <- sim_data |>
mutate(
time_raw = h_to_t(target_h),
eventtime = pmin(time_raw, max_time),
status = as.numeric(time_raw <= max_time)
) |>
select(-re, -time_raw)
)
fit1 <- coxph(
Surv(eventtime, status) ~ x + frailty(id, distribution = "gaussian"),
data = sim_data1
)
# recovery of estimates and duration
cat("True Effect: ", fixed_effect, "\n")
cat("Estimated Effect: ", summary(fit1)$coefficients["x", "coef"], "\n")
cat("Simulation Time: ", systime_simHand["elapsed"], " seconds\n")
# True Effect: 0.2
# Estimated Effect: 0.2061
# Simulation Time: 0.002 seconds
# Next, I did the same thing but using the simsurv() function
# de-cumulate the baseline hazard to get the hazard function
baseline_hazard <- data.frame(
time = cum_baseline_hazard$time,
hazard = diff(c(0,cum_baseline_hazard$hazard))
)
# specifiy a hazard function for simsurv()
my_hazard <- function(t, x, betas) {
h0 <- approx(
x = baseline_hazard$time,
y = baseline_hazard$hazard,
xout = t,
method = "linear",
rule = 2
)$y
h <- h0 * exp(betas["x"] * x$x + betas["re"] * x$re)
}
systime_simsurv <- system.time(simsuv_data <- simsurv(
hazard = my_hazard,
x = sim_data,
betas = c(x = fixed_effect, re = 1), # effect of x and random effect
maxt = max_time
))
sim_dat2 <- cbind(sim_data, simsuv_data[, c("eventtime", "status")])
fit2 <- coxph(
Surv(eventtime, status) ~ x + frailty(id, distribution = "gaussian"),
data = sim_dat2
)
cat("True Effect: ", fixed_effect, "\n")
cat("Estimated Effect: ", summary(fit2)$coefficients["x", "coef"], "\n")
cat("Simulation Time: ", systime_simsurv["elapsed"], " seconds\n")
# True Effect: 0.2
# Estimated Effect: 0.212
# Simulation Time: 13.76 seconds
I am very grateful for any insights!
Reference:
Bender, R., Augustin, T., & Blettner, M. (2005). Generating survival times to simulate Cox proportional hazards models. Statistics in Medicine, 24(11), 1713–1723. https://doi.org/10.1002/sim.2059