Left Truncation: Kaplan Meier, Cox Models, and an Application to Simulated Astronomy Data

Left Truncation Review

We discuss how to fit parametric models, Kaplan Meier curves, and Cox PH with left truncated data. With left truncated data, for each observation \(i\) there is a lower limit \(L_i\) such that the data point \(T_i\) would not have been included in the sample if \(T_i < L_i\). One common example of this situation is in astronomy. On a given night, a telescope will have a minimum brightness it can observe \(L_i\). Any star less bright than \(L_i\) will not make it into the sample. The value \(L_i\) changes from night to night based on observing conditions. For example if it is cloudy with a full moon \(L_i\) will be large (can only observe bright stars). However a cloudless night with no moon will have little light pollution and allow us to observe starts down to a small \(L_i\) value. The data we obtain is \((L_i,T_i)_{i=1}^n\). Our goal is to estimate the density of star brightnesses. We will see that ignoring the \(L_i\), and creating a density estimate based only on \(T_i\), leads to large biases.

Left Truncation Likelihood

Let \(f_X\) be the distribution of \(T\) without trunction. For a single observation \((L,T)\), we assume the joint density of \((L,T)\) is:

\[ f(l,t) = f(t|l)f(l) = \frac{f_X(t)\mathbb{1}_{t \geq l}}{P(T > l)}f(l) = \frac{f_X(t)\mathbb{1}_{t \geq l}}{S_X(l)}f(l) \]

Supposing that the distribution \(f_X\) depends on unknown parameters \(\beta\) (and hence \(S_X\) depends on \(\beta\)) and that the distribution of \(L\) does not depend on \(\beta\) the likelihood for all observations is

\[ L(\beta) \propto \prod_{i=1}^n \frac{f_X(t_i|\beta)\mathbb{1}_{t_i \geq l_i}}{S_X(l_i|\beta)} \]

Left Truncation with Simulated Astronomy Example

Simulate Distribution of Star Luminosities

We simulate a population of stars.

library(survival)
library(KMsurv)
library(ggplot2)
library(ggfortify)

set.seed(11042019)
n.pop <- 1e6
beta_true <- 0.3
x.pop <- rexp(n.pop,beta_true)
hist(x.pop,breaks=50,main="",xlim=c(0,20))

Collect Sample, Truncate, and Conduct Naive Analysis

We collect a sample.

n.samp <- 1000
## untruncated sample
t <- sample(x.pop,n.samp)
hist(t)

Truncate the sample. Assume different truncation values on good_night and bad_night. On good nights there are no clouds and no moon so the truncation value is low (the telescope can see dim objects). On bad nights there is a lot of light pollution so the truncation value is larger (can only see bright objects).

good_night <- 2
bad_night <- 5

## sample trunction times
l <- sample(c(good_night,bad_night),1000,replace=TRUE,prob=c(.2,.8))
dat <- cbind(l,t)
nrow(dat)

## [1] 1000

dat <- dat[t > l,]
head(dat)

##      l         t
## [1,] 5 10.404149
## [2,] 2  3.404757
## [3,] 5 10.858843
## [4,] 2  3.053563
## [5,] 5  5.543121
## [6,] 2  3.300628

nrow(dat)

## [1] 272

hist(dat[,2],main="Observed Distribution of Brightnesses",breaks=40)

Suppose we ignore the truncation and compute a density estimate using the sample. The result is very wrong. This can be see from above histogram which contains two modes. However the modes are completely an artifact of the sampling. However in a real life application you would not know the result is very wrong because we would not have seen the population distribution earlier.

d <- density(dat[,2],bw="SJ")
hist(t,freq=FALSE,breaks=30,xlim=c(good_night,20),main="Distribution of Star Brightness")
abline(v=c(2,5),col='orange',lwd=2.5)
points(d$x,d$y,type='l',col='blue',lwd=2.5)
legend("topright",c("Ignoring Truncation","Truncation Points"),col=c("blue","orange"),lwd=2.5)

Parameteric Modeling

If we assume the correct functional form (exponential), then it is possible to analytically derive the MLE. The survival function is \(S_X(t|\beta) = \int_0^t \beta \exp^{-\beta s}ds = \exp{-\beta t}\). So the likelihood becomes:

\[ L(\beta) = \prod_{i=1}^n \frac{f(t_i|\beta)}{S(l_i|\beta)} = \frac{\beta e^{-\beta t_i}}{e^{-\beta l_i}} = \beta e^{\beta(l_i - t_i)} \] Differentiating, setting equal to 0, and solving we obtain

\[ \widehat{\beta} = \frac{n}{\sum_{i=1}^n (t_i - l_i)} \] This is simple to compute by hand:

## add censoring column (always 1 b/c never censored)
dat <- cbind(dat,1)
dat <- as.data.frame(dat)
colnames(dat)[3] <- "delta"

beta_hat <- nrow(dat)/sum(dat$t - dat$l)
beta_hat

## [1] 0.2978863

beta_true

## [1] 0.3

We see that the estimate is reasonably accurate. Uncertainties could be calculated using the delta method.

Compute Kaplan-Meier for Truncated

A disadvantage to the above approach is that we assumed the exponential form of the model. Often this would not be known in practice. We now compute the Kaplan Meier estimate for the truncated data. Note that the Kaplan-Meier estimates the survival function conditional on the smallest truncation time, in this case good_night. So we estimate \(P(X > t | X \geq 2)\). Below 2, we have no idea the form of the survival function.

fit <- survfit(Surv(l,t,delta)~1,data=dat)
plot(fit)
abline(v=good_night)

Note that the survival function is \(1\) for brightness greater than 2. This is because the code is estimating the survival function conditional on star brightness greater than 2, not because we believe in the population all stars are brigher than this.

Kaplan-Meier to Density

We turn the KM plot into a density estimate. While survival functions are popular in health and biological applications, they are harder to interpret than density functions for many applications. Recall that the density is the negative derivative of the survival function:

\[ f(t) = -S'(t) \]

We use the function locpoly to compute the derivative of the survival function.

library(KernSmooth)

## KernSmooth 2.23 loaded
## Copyright M. P. Wand 1997-2009

## this is - density
out <- locpoly(fit$time,fit$surv,drv=1,bandwidth=1)
str(out)

## List of 2
##  $ x: num [1:401] 2 2.06 2.12 2.18 2.23 ...
##  $ y: num [1:401] -0.276 -0.275 -0.274 -0.272 -0.27 ...

out$y <- -out$y
out$y[out$y<0] <- 0

We plot the Kaplan Meier based density estimate and the naive density estimate (ignoring truncation).

hist(t[t>good_night],freq=FALSE,breaks=30,xlim=c(2,20),main="Conditional Distribution of Magnitudes")
abline(v=c(good_night,bad_night),col='orange',lwd=2.5)
points(out$x,out$y,type='l',lwd=2.5,col='red')
points(d$x,d$y,type='l',col='blue',lwd=2.5)
legend("topright",c("KM Based Density","Ignoring Truncation","Truncation Points"),col=c("red","blue","orange"),lwd=2.5)

Ignoring the left truncation leads to poor performance. Note that the above histogram is for the density conditioned on \(x > 2\), the smallest truncation value.

Reproduce Example 9.4 in Textbook

We demonstrate the Cox PH model with left truncated and right censored data by reproducing Example 9.4 in the textbook.

data(channing)
channing <- channing[channing$ageentry<channing$age,]
## make some plots

fit <- coxph(Surv(ageentry,age,death)~gender,data=channing,method='breslow')
fit

## Call:
## coxph(formula = Surv(ageentry, age, death) ~ gender, data = channing, 
##     method = "breslow")
## 
##           coef exp(coef) se(coef)      z      p
## gender -0.3158    0.7292   0.1731 -1.824 0.0682
## 
## Likelihood ratio test=3.16  on 1 df, p=0.07531
## n= 458, number of events= 176