```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
options(knitr.duplicate.label = 'allow')
library(KMsurv)
library(survival)
library(ggplot2)
```
## Purpose of Model Building
Section 8.7 discusses two possible approaches to model building based on what the model is intended to do. The model may be used to:
1. Assess causal relationships by accounting for confounding and/or interactions. Usually there is a variable of primary interest which we suspect is causing some response.
2. Predict time-to-event. We would like to obtain accurate estimates of time to event given some covariates. We do not necessarily care if any of these covariates cause the event.
The book mostly uses p-values and Akaike Information Criteria (AIC) in the context of forward stepwise regression for both of these tasks. While these are useful tools, the book does not discuss:
a. graphical exploratory data analysis
b. formal causal inference frameworks (e.g. Pearl's counterfactual framework based on DAGs or Rubin's potential outcome framework)
c. cross--validation, training--test sets, and scoring functions for measuring predictive performance (sometimes called statistical machine learning)
I believe b. is particularly underrated in the statistics community. For example the book discusses "adjusting for potential confounders" but a satisfactory definition of confounding requires a formal causal inference framework.
We discuss a. and a little bit of b. here as well as the p-value and AIC used in the textbook. After Spring Break, we will discuss c. in the context of high dimensional survival prediction.
## Examples of Confounders, Interactions, Mediators: Simulations
Here we assume the primary interest is the $Z_1$, $Y$ relationship and $Z_2$ is another variable we may add to the model.
#### Confounding
We suppose $Z_2 \rightarrow Z_1$ and $Z_2 \rightarrow Y$. There is no arrow from $Z_1$ to $Y$, so $Z_1$ has no causal effect on $Y$. Here $Z_2$ confounds the $Z_1$, $Y$ relationship and is important to put it in the model.
```{r confounder}
n <- 1000
z2 <- rnorm(n)
z1 <- 2*z2 + 1 + rnorm(n)
x <- rexp(n,rate=exp(z2))
y <- pmin(1,x) ## observations censored at time 1
del <- 1*(x < 1) ## indicator for censoring
## very significant pvalue, not controlling for z2
coxph(Surv(y,del)~z1)
## after controlling for z2, not significant
## true coefficient on z1 is 0
coxph(Surv(y,del)~z1+z2)
```
#### Mediator
We suppose $Z_1 \rightarrow Z_2 \rightarrow Y$. Here $Z_2$ is a **mediator**. Controlling for $Z_2$ results in the wrong conclusions.
```{r mediator}
n <- 1000
z1 <- rnorm(n)
z2 <- 2*z1 + 1 + rnorm(n)
x <- rexp(n,rate=exp(z2))
y <- pmin(1,x) ## observations censored at time 1
del <- 1*(x < 1) ## indicator for censoring
## very significant pvalue, correct analysis
coxph(Surv(y,del)~z1)
## after controlling for z2, not significant
## incorrect analysis
coxph(Surv(y,del)~z1+z2)
```
#### Interaction
We suppose $Z_1 \rightarrow Y$ and $Z_2 \rightarrow Y$ but $Z_1$ and $Z_2$ are independent. $Z_1$ and $Z_2$ can **interact** in the model which means the effect of $Z_1$ on $Y$ depends on the value of $Z_2$.
** No Interaction Case:**
```{r no_interaction}
n <- 1000
z1 <- rbinom(n,size=1,prob=0.5)
z2 <- rbinom(n,size=1,prob=0.5)
x <- rexp(n,rate=exp(z1 + z2))
y <- pmin(1,x) ## observations censored at time 1
del <- 1*(x < 1) ## indicator for censoring
## very significant pvalue, correct analysis
coxph(Surv(y,del)~z1+z2)
## leaving out z2 is suboptimal, but not too bad
coxph(Surv(y,del)~z1)
```
** Quantitative Interaction - Magnitude of z1 effect depends on z2:**
```{r interaction_coherent}
n <- 1000
z1 <- rbinom(n,size=1,prob=0.5)
z2 <- rbinom(n,size=1,prob=0.5)
x <- rexp(n,rate=exp(z1 + z2 + z1*z2))
y <- pmin(1,x) ## observations censored at time 1
del <- 1*(x < 1) ## indicator for censoring
## very significant pvalue, correct analysis
coxph(Surv(y,del)~z1+z2+z1*z2)
## leaving out interaction is suboptimal, but not too bad
coxph(Surv(y,del)~z1+z2)
```
** Qualitative Interaction - Direction of z1 effect depends on z2:**
```{r interaction_switch}
n <- 1000
z1 <- rbinom(n,size=1,prob=0.5)
z2 <- rbinom(n,size=1,prob=0.5)
x <- rexp(n,rate=exp(z1 + z2 - 3*z1*z2))
y <- pmin(1,x) ## observations censored at time 1
del <- 1*(x < 1) ## indicator for censoring
## very significant pvalue, correct analysis
coxph(Surv(y,del)~z1+z2+z1*z2)
## leaving out interaction we only get an average of effects
## which results in a negative coefficient because the
## absolute size of the - 3*z1*z2 is larger than the z1 term
coxph(Surv(y,del)~z1+z2)
```
#### Summary
* It is possible to over control, e.g. usually controlling for a mediator is a bad idea.
* Randomized experiments disable any arrows into $z_1$, thus eliminating confounders.
* The concepts **confounder** and **adjusting for a covariate** are usually linked with causality even if not explicitly stated this way.
* The above examples provide some causal context for Section 8.7 of the textbook. See Pearl 2009 "Causality" for the whole story.
## Book Approach to Model Building
The book advocates building the model by sequentially adding covariates one at a time, choosing either the covariate with the smallest p-value from a local test or the covariate producing the lowest AIC. The procedure terminates when all potential new covariates do not meet the p-value threshold (usually 0.05) or increase the AIC. This approach is sometimes known as forward model selection. In the case where one predictor is of primary interest, say $z_1$, the book recommends including this predictor first in the model.
Opinion: When the goal is to understand the causal impact of some variable on $Y$, I find this approach weak because it does not operate in a formal causal framework. The final interpretation of the model coefficients is unclear. When the goal is building a predictive model, the approach is more justifiable. In this case the model predictions, in the form a survival curve for each set of covariates is of primary interest (rather than model coefficients). There are many more modern methods, such as penalized regression, that will probably work better than the forward selection approach discussed here. We will study some of those methods later in the course.
## Assessing Causal Relationships: Bone Marrow Transplant Data
#### Goals and Exploratory Data Analysis
We would like to understand the associated between disease type and survival. Does disease type itself cause survival or are there confounders which cause both?
```{r exploratory_data_analysis, fig.height=4,fig.width=5}
data(bmt)
?bmt
table(bmt$group)
disease <- c("ALL","AML Low","AML High")
bmt$disease <- as.factor(disease[bmt$group])
table(bmt$disease)
fit <- survfit(Surv(t2,d3)~disease,data=bmt)
par(mar=c(5,5,1,1))
plot(fit,xlab="Time",lty=1:3,col=1:3)
legend("topright",names(fit$strata),lty=1:3,col=1:3)
```
We should discuss potential confounders with doctors and biologists. Graphs can help generate theories. For example what about age of patient and wait time to transplant as potential confounders.
```{r graphical_potential_confounders,fig.height=3,fig.width=5}
par(mfcol=c(1,2))
ggplot(bmt, aes(x=z1, fill=bmt$disease)) +
geom_density(alpha=0.3) + xlab("Age of Patient")
ggplot(bmt, aes(x=z7, fill=bmt$disease)) +
geom_density(alpha=0.3) + xlab("Wait Time to Transplant") +
xlim(0,500)
```
Note the relationship between the survival curves and the Wait Time to Transplant densities. Perhaps
$$
\text{Disease } \rightarrow \text{ Transplant Wait Time } \rightarrow \text{Survival}
$$
* Disease $\rightarrow$ Transplant relationship may be the result of algorithms which decide who gets a transplant first. These algorithms may favor AML Low over ALL over AML High.
* Transplant Wait Time $\rightarrow$ Survival relationship may be a result of cancer progression while waiting for transplant, the longer you wait, the more the cancer grows, thus decreasing survival probability.
Supposing this is the case, Disease does have a causal effect on Survival but not through intrinsic biology, but rather through wait times. Future research should focus on reducing wait times, not understanding the biology of the diseases.
The above paragraphs are speculation, but this sort of thinking is useful for generating possible causal relations and the appropriate resulting policy actions. The data can be used to assess the degree of support for these ideas.
#### Forward Model Selection (Example 8.5)
We now illustrate forward selection with p-values and AIC. Fit the base model with the primary covariate of interest, `disease`.
```{r formal_bmt}
?bmt
fit <- coxph(Surv(t2,d3)~disease,data=bmt)
fit
fit$wald.test
pchisq(fit$wald.test,df=2,lower.tail=FALSE)
aic_base <- -2*fit$loglik[2] + 2*length(fit$coef)
aic_base
```
Now consider adding each variable to the model. First we consider waiting time to transplant:
```{r formal_bmt_add_one}
fit <- coxph(Surv(t2,d3)~disease+z7,data=bmt,method="breslow")
fit
fit$wald.test
## compute p-value for local likelihood ratio test
w <- fit$coef[3]^2/fit$var[3,3]
pchisq(w,df=1,lower.tail=FALSE)
## compute AIC
-2*fit$loglik[2] + 2*length(fit$coef)
aic_base
```
If we repeat this analysis for all covariates we get the following table:
![Table 8.6](https://longjp.github.io/survival/lectures/km_table8_6.png)
The textbook advocates adding the covariate with the lowest p-value or lowest AIC to the model, in this case FAB. This process is repeated on the remaining covariates, with disease type and FAB in model until no covariates pass the p-value threshold of 0.05 or decrease the AIC.
```{r formal_bmt_table, echo=FALSE,eval=FALSE}
##### THIS CODE IS UNDER CONSTRUCTION, IGNORE
## rename variables
bmt$waiting <- bmt$z7
bmt$fab <- bmt$z8
bmt$mtx <- bmt$z10
bmt$sex <- interaction(bmt$z3,bmt$z4)
bmt$cmv <- interaction(bmt$z5,bmt$z6)
fit <- coxph(Surv(t2,d3)~bmt[,c("disease",to_fit[1])],data=bmt)
?eval
bmt[,c("disease",to_fit[1])]
eval(to_fit[1])
to_fit <- c("waiting","fab","mtx","sex","cmv")
tab <- matrix(0,nrow=length(to_fit),ncol=4)
for(ii in 1:length(to_fit)){
}
fit <- coxph(Surv(t2,d3)~disease+z7,data=bmt)
fit
fit$wald.test
w <- fit$coef[3]^2/fit$var[3,3]
pchisq(w,df=1,lower.tail=FALSE)
```
## Predictive Modeling
In example 8.6 the book also discusses forward selection when the primary goal is to **find a model predictive of the distribution of time to weaning.** Later **We are mainly interested in finding factors which contribute to the distribution of time to weaning.**
The p-value and AIC criteria are proxies for quality improvement of the predictive model when adding a covariate. We discuss other measures of predictive performance of a model after Spring break. These methods will enable comparison of more general models.
## Generating Survival Curves
One we have estimated the parametric component of the Cox PH model with $\widehat{\beta}$, we can determine the nonparametric baseline hazard component as
$$
\widehat{\Lambda_0}(t) = \sum_{t_i \leq t} \frac{d_i}{\sum_{j \in R(t_i)} exp(\widehat{\beta}^T Z_j)}
$$
The resulting baseline survival function is $\widehat{S}_0(t) = exp(-\widehat{\Lambda}_0(t))$ and the survival function for an individual with covariates $Z$ is
$$
\widehat{S}(t|Z) = \widehat{S}_0(t)^{exp(\widehat{\beta}^TZ)}
$$
The observation specific survival functions can be found by running `survfit` on a `coxph` object in `R`.
```{r coxph_predicted_curve}
fit <- coxph(Surv(t2,d3)~disease,data=bmt,method="breslow")
fit
## predict survival curve for observations 1 and 100
bmt_sub <- bmt[c(1,100),]
out <- survfit(fit,newdata=bmt_sub)
str(out)
plot(out,col=1:2)
```
## Optional Exercises / Work
* Read "Statistical Modeling: The Two Cultures" by Leo Breiman.
* Reproduce Table 8.12 in textbook.
```{r makedotRfile, echo=FALSE, include=FALSE}
## automatrically generates .R file when running knitr
knitr::purl("03coxph.Rmd")
```