Left Truncation Review

We discuss how to fit parametric models, Kaplan Meier curves, and Cox PH with left truncated data. With left truncated data, for each observation \(i\) there is a lower limit \(L_i\) such that the data point \(T_i\) would not have been included in the sample if \(T_i < L_i\). One common example of this situation is in astronomy. On a given night, a telescope will have a minimum brightness it can observe \(L_i\). Any star less bright than \(L_i\) will not make it into the sample. The value \(L_i\) changes from night to night based on observing conditions. For example if it is cloudy with a full moon \(L_i\) will be large (can only observe bright stars). However a cloudless night with no moon will have little light pollution and allow us to observe starts down to a small \(L_i\) value. The data we obtain is \((L_i,T_i)_{i=1}^n\). Our goal is to estimate the density of star brightnesses. We will see that ignoring the \(L_i\), and creating a density estimate based only on \(T_i\), leads to large biases.

Left Truncation Likelihood

Let \(f_X\) be the distribution of \(T\) without trunction. For a single observation \((L,T)\), we assume the joint density of \((L,T)\) is:

\[ f(l,t) = f(t|l)f(l) = \frac{f_X(t)\mathbb{1}_{t \geq l}}{P(T > l)}f(l) = \frac{f_X(t)\mathbb{1}_{t \geq l}}{S_X(l)}f(l) \]

Supposing that the distribution \(f_X\) depends on unknown parameters \(\beta\) (and hence \(S_X\) depends on \(\beta\)) and that the distribution of \(L\) does not depend on \(\beta\) the likelihood for all observations is

\[ L(\beta) \propto \prod_{i=1}^n \frac{f_X(t_i|\beta)\mathbb{1}_{t_i \geq l_i}}{S_X(l_i|\beta)} \]

Left Truncation with Simulated Astronomy Example

Simulate Distribution of Star Luminosities

We simulate a population of stars.

library(survival)
library(KMsurv)
library(ggplot2)
library(ggfortify)
set.seed(11042019)
n.pop <- 1e6
beta_true <- 0.3
x.pop <- rexp(n.pop,beta_true)
hist(x.pop,breaks=50,main="",xlim=c(0,20))

Collect Sample, Truncate, and Conduct Naive Analysis

We collect a sample.

n.samp <- 1000
## untruncated sample
t <- sample(x.pop,n.samp)
hist(t)

Truncate the sample. Assume different truncation values on good_night and bad_night. On good nights there are no clouds and no moon so the truncation value is low (the telescope can see dim objects). On bad nights there is a lot of light pollution so the truncation value is larger (can only see bright objects).

good_night <- 2
bad_night <- 5

## sample trunction times
l <- sample(c(good_night,bad_night),1000,replace=TRUE,prob=c(.2,.8))
dat <- cbind(l,t)
nrow(dat)
## [1] 1000
dat <- dat[t > l,]
head(dat)
##      l         t
## [1,] 5 10.404149
## [2,] 2  3.404757
## [3,] 5 10.858843
## [4,] 2  3.053563
## [5,] 5  5.543121
## [6,] 2  3.300628
nrow(dat)
## [1] 272
hist(dat[,2],main="Observed Distribution of Brightnesses",breaks=40)

Suppose we ignore the truncation and compute a density estimate using the sample. The result is very wrong. This can be see from above histogram which contains two modes. However the modes are completely an artifact of the sampling. However in a real life application you would not know the result is very wrong because we would not have seen the population distribution earlier.

d <- density(dat[,2],bw="SJ")
hist(t,freq=FALSE,breaks=30,xlim=c(good_night,20),main="Distribution of Star Brightness")
abline(v=c(2,5),col='orange',lwd=2.5)
points(d$x,d$y,type='l',col='blue',lwd=2.5)
legend("topright",c("Ignoring Truncation","Truncation Points"),col=c("blue","orange"),lwd=2.5)