Are my simulations of multi-level survival times correct?

Question

I came across the article from Bender et al.(2005) and attempted to put this into R code to simulate survival times based on an empirical baseline hazard from existing data.

I compute survival times based on equation 6 that transforms "uniformly distributed random numbers [...] into survival times following a specific Cox model [where it] is just required to insert the inverse of an appropriate cumulative baseline hazard function into equation (6)" (Bender et al.,2005, p. 1716).

Said equation is: $T = H^{−1}_0 [− log(U ) exp(− \beta′x)]$

Considering a random effect (re) and one predictor (x), I coded:

# calculate the linear predictor for each observation
linpred <- beta * x + e

# simulate event times using the inverse transform sampling method
u <- runif(n_total)
target_h <- -log(u) / exp(linpred)

# Interpolate: find t such that H(t) = target_h
h_to_t <- approxfun(
  x = cum_baseline_hazard$hazard,
  y = cum_baseline_hazard$time,
  method = "linear",
  rule = 2, # clamp to range
  ties = "mean"
)

# generate survival times
T = h_to_t(target_h)

where the cum_baseline_hazard object is computed from real data.

This is considerebly faster then using the simsuv() function from the same-named package. For a scenatio of 100 clusters, 10 observations per cluster, a fixed effect of 0.2, and a standard deviation of the random effect of 0.5, my way of computing the survival times takes 0.002 seconds and recovered the true effect quite well (estimated effect = 0.2061), whereas simsurv() took 13.76 second (with a well recovered estimated effect as well, 0.212).

Is my translation of the equation correct?

For a replication of those results and an example (with no actually real data though), see my code below.

library(tidyverse)
library(survival)
library(simsurv)

set.seed(42)

n_cluster <- 100
n_obs_per_cluster <- 10
n_total <- n_cluster * n_obs_per_cluster
fixed_effect <- 0.2
random_effect_sd <- 0.5
max_time <- 10

# assume this would be real data
true_data <- tibble(
  id = rep(1:n_cluster, each = n_obs_per_cluster),
  x = rnorm(n_total)
) |>
  mutate(
    eventtime = rexp(
      n_total,
      rate = exp(fixed_effect * x + rnorm(n_cluster, sd = random_effect_sd)[id])
    ),
    censor_time = runif(n_total, 0, max_time), # or fixed max, or rexp for random censoring
    eventtime = pmin(eventtime, censor_time),
    status = as.integer(eventtime <= censor_time)
  )

# I fit the model to my true data
m_true <- coxph(
  Surv(eventtime, status) ~ x + frailty(id, distribution = "gaussian"),
  data = true_data
)

# Then I extract the cumulative bazeline hazards:
cum_baseline_hazard <- basehaz(m_true)
head(cum_baseline_hazard)
#     hazard      time
# 1 0.000988 0.0005937
# 2 0.001977 0.0016313
# 3 0.002967 0.0038257
# 4 0.003958 0.0039334
# 5 0.004950 0.0040876
# 6 0.005944 0.0042212

# Now, I simlulate data based on the parameters above and based on the # inverse transform sampling method.
# actual simulation
sim_data <- tibble(
  id = rep(1:n_cluster, each = n_obs_per_cluster),
  x = rnorm(n_total),
  re = rnorm(n_cluster, mean = 0, sd = random_effect_sd)[id]
)

# calculate the linear predictor for each observation
linpred <- fixed_effect * sim_data$x + sim_data$re

# simulate event times using the inverse transform sampling method
u <- runif(n_total)
target_h <- -log(u) / exp(linpred)

# Interpolate: find t such that H(t) = target_h
h_to_t <- approxfun(
  x = cum_baseline_hazard$hazard,
  y = cum_baseline_hazard$time,
  method = "linear",
  rule = 2, # clamp to range
  ties = "mean"
)

systime_simHand <- system.time(sim_data1 <- sim_data |>
  mutate(
    time_raw = h_to_t(target_h),
    eventtime = pmin(time_raw, max_time),
    status = as.numeric(time_raw <= max_time)
  ) |>
  select(-re, -time_raw)
)

fit1 <- coxph(
  Surv(eventtime, status) ~ x + frailty(id, distribution = "gaussian"),
  data = sim_data1
)

# recovery of estimates and duration
cat("True Effect: ", fixed_effect, "\n")
cat("Estimated Effect: ", summary(fit1)$coefficients["x", "coef"], "\n")
cat("Simulation Time: ", systime_simHand["elapsed"], " seconds\n")
# True Effect:  0.2 
# Estimated Effect:  0.2061 
# Simulation Time:  0.002  seconds

# Next, I did the same thing but using the simsurv() function
# de-cumulate the baseline hazard to get the hazard function
baseline_hazard <- data.frame(
  time = cum_baseline_hazard$time,
  hazard = diff(c(0,cum_baseline_hazard$hazard))
)

# specifiy a hazard function for simsurv()
my_hazard <- function(t, x, betas) {
  h0 <- approx(
    x = baseline_hazard$time,
y = baseline_hazard$hazard,
    xout = t,
    method = "linear",
    rule = 2
  )$y

  h <- h0 * exp(betas["x"] * x$x + betas["re"] * x$re)
}

systime_simsurv <- system.time(simsuv_data <- simsurv(
  hazard = my_hazard,
  x = sim_data,
  betas = c(x = fixed_effect, re = 1), # effect of x and random effect
  maxt = max_time
))

sim_dat2 <- cbind(sim_data, simsuv_data[, c("eventtime", "status")])

fit2 <- coxph(
  Surv(eventtime, status) ~ x + frailty(id, distribution = "gaussian"),
  data = sim_dat2
)
cat("True Effect: ", fixed_effect, "\n")
cat("Estimated Effect: ", summary(fit2)$coefficients["x", "coef"], "\n")
cat("Simulation Time: ", systime_simsurv["elapsed"], " seconds\n")
# True Effect:  0.2 
# Estimated Effect:  0.212 
# Simulation Time:  13.76  seconds

I am very grateful for any insights!

Reference:

Bender, R., Augustin, T., & Blettner, M. (2005). Generating survival times to simulate Cox proportional hazards models. Statistics in Medicine, 24(11), 1713–1723. https://doi.org/10.1002/sim.2059

Thomas Lumley · Accepted Answer · 2026-03-18 20:52:52Z

3

Your code is faster because you know the inverse cumulative hazard function you're using is piecewise linear and simsurv doesn't. It's inverting a function that has been given to it as a black box.

On top of that, the inverse cumulative hazard function for your approach is (effectively) written in C and for simsurv is written in interpreted R.

answered Mar 18 at 20:52

Thomas Lumley

59.9k1 gold badge83 silver badges234 bronze badges

$\begingroup$ Thanks! I see the piecewise linear function and good to know that my stuff is effectively C. I though that since I give essentially the same hazard function to simsurv(), it should work more or less equal and also equally fast. The only thing that's missing is the starting point, which was U in my case. $\endgroup$

peer
– peer

2026-03-19 08:56:53 +00:00
Commented Mar 19 at 8:56

Add a comment |

Stack Exchange Network

Are my simulations of multi-level survival times correct?

1 Answer 1

Hot Network Questions

Are my simulations of multi-level survival times correct?

1 Answer 1

Related

Hot Network Questions