Purpose of Model Building

Section 8.7 discusses two possible approaches to model building based on what the model is intended to do. The model may be used to:

  1. Assess causal relationships by accounting for confounding and/or interactions. Usually there is a variable of primary interest which we suspect is causing some response.
  2. Predict time-to-event. We would like to obtain accurate estimates of time to event given some covariates. We do not necessarily care if any of these covariates cause the event.

The book mostly uses p-values and Akaike Information Criteria (AIC) in the context of forward stepwise regression for both of these tasks. While these are useful tools, the book does not discuss:

  1. graphical exploratory data analysis
  2. formal causal inference frameworks (e.g. Pearl’s counterfactual framework based on DAGs or Rubin’s potential outcome framework)
  3. cross–validation, training–test sets, and scoring functions for measuring predictive performance (sometimes called statistical machine learning)

I believe b. is particularly underrated in the statistics community. For example the book discusses “adjusting for potential confounders” but a satisfactory definition of confounding requires a formal causal inference framework.

We discuss a. and a little bit of b. here as well as the p-value and AIC used in the textbook. After Spring Break, we will discuss c. in the context of high dimensional survival prediction.

Examples of Confounders, Interactions, Mediators: Simulations

Here we assume the primary interest is the \(Z_1\), \(Y\) relationship and \(Z_2\) is another variable we may add to the model.

Confounding

We suppose \(Z_2 \rightarrow Z_1\) and \(Z_2 \rightarrow Y\). There is no arrow from \(Z_1\) to \(Y\), so \(Z_1\) has no causal effect on \(Y\). Here \(Z_2\) confounds the \(Z_1\), \(Y\) relationship and is important to put it in the model.

n <- 1000
z2 <- rnorm(n)
z1 <- 2*z2 + 1 + rnorm(n)
x <- rexp(n,rate=exp(z2))
y <- pmin(1,x) ## observations censored at time 1
del <- 1*(x < 1) ## indicator for censoring

## very significant pvalue, not controlling for z2
coxph(Surv(y,del)~z1)
## Call:
## coxph(formula = Surv(y, del) ~ z1)
## 
##       coef exp(coef) se(coef)     z      p
## z1 0.40704   1.50237  0.02064 19.72 <2e-16
## 
## Likelihood ratio test=406.5  on 1 df, p=< 2.2e-16
## n= 1000, number of events= 617
## after controlling for z2, not significant
## true coefficient on z1 is 0
coxph(Surv(y,del)~z1+z2)
## Call:
## coxph(formula = Surv(y, del) ~ z1 + z2)
## 
##       coef exp(coef) se(coef)     z      p
## z1 0.04227   1.04317  0.04243 0.996  0.319
## z2 0.96705   2.63018  0.09853 9.815 <2e-16
## 
## Likelihood ratio test=505.2  on 2 df, p=< 2.2e-16
## n= 1000, number of events= 617

Mediator

We suppose \(Z_1 \rightarrow Z_2 \rightarrow Y\). Here \(Z_2\) is a mediator. Controlling for \(Z_2\) results in the wrong conclusions.

n <- 1000
z1 <- rnorm(n)
z2 <- 2*z1 + 1 + rnorm(n)
x <- rexp(n,rate=exp(z2))
y <- pmin(1,x) ## observations censored at time 1
del <- 1*(x < 1) ## indicator for censoring

## very significant pvalue, correct analysis
coxph(Surv(y,del)~z1)
## Call:
## coxph(formula = Surv(y, del) ~ z1)
## 
##       coef exp(coef) se(coef)     z      p
## z1 1.34662   3.84440  0.04896 27.51 <2e-16
## 
## Likelihood ratio test=798.9  on 1 df, p=< 2.2e-16
## n= 1000, number of events= 729
## after controlling for z2, not significant
## incorrect analysis
coxph(Surv(y,del)~z1+z2)
## Call:
## coxph(formula = Surv(y, del) ~ z1 + z2)
## 
##       coef exp(coef) se(coef)      z      p
## z1 0.11367   1.12038  0.07569  1.502  0.133
## z2 0.97626   2.65450  0.04350 22.440 <2e-16
## 
## Likelihood ratio test=1340  on 2 df, p=< 2.2e-16
## n= 1000, number of events= 729

Interaction

We suppose \(Z_1 \rightarrow Y\) and \(Z_2 \rightarrow Y\) but \(Z_1\) and \(Z_2\) are independent. \(Z_1\) and \(Z_2\) can interact in the model which means the effect of \(Z_1\) on \(Y\) depends on the value of \(Z_2\).

** No Interaction Case:**

n <- 1000
z1 <- rbinom(n,size=1,prob=0.5)
z2 <- rbinom(n,size=1,prob=0.5)
x <- rexp(n,rate=exp(z1 + z2))
y <- pmin(1,x) ## observations censored at time 1
del <- 1*(x < 1) ## indicator for censoring

## very significant pvalue, correct analysis
coxph(Surv(y,del)~z1+z2)
## Call:
## coxph(formula = Surv(y, del) ~ z1 + z2)
## 
##       coef exp(coef) se(coef)     z      p
## z1 1.02488   2.78675  0.07204 14.23 <2e-16
## z2 0.99199   2.69661  0.07170 13.84 <2e-16
## 
## Likelihood ratio test=356.5  on 2 df, p=< 2.2e-16
## n= 1000, number of events= 878
## leaving out z2 is suboptimal, but not too bad
coxph(Surv(y,del)~z1)
## Call:
## coxph(formula = Surv(y, del) ~ z1)
## 
##       coef exp(coef) se(coef)     z      p
## z1 0.89620   2.45027  0.07029 12.75 <2e-16
## 
## Likelihood ratio test=163.9  on 1 df, p=< 2.2e-16
## n= 1000, number of events= 878

** Quantitative Interaction - Magnitude of z1 effect depends on z2:**

n <- 1000
z1 <- rbinom(n,size=1,prob=0.5)
z2 <- rbinom(n,size=1,prob=0.5)
x <- rexp(n,rate=exp(z1 + z2 + z1*z2))
y <- pmin(1,x) ## observations censored at time 1
del <- 1*(x < 1) ## indicator for censoring

## very significant pvalue, correct analysis
coxph(Surv(y,del)~z1+z2+z1*z2)
## Call:
## coxph(formula = Surv(y, del) ~ z1 + z2 + z1 * z2)
## 
##         coef exp(coef) se(coef)     z        p
## z1    0.9157    2.4986   0.1041 8.793  < 2e-16
## z2    0.9534    2.5945   0.1051 9.072  < 2e-16
## z1:z2 1.1381    3.1209   0.1481 7.683 1.56e-14
## 
## Likelihood ratio test=608  on 3 df, p=< 2.2e-16
## n= 1000, number of events= 860
## leaving out interaction is suboptimal, but not too bad
coxph(Surv(y,del)~z1+z2)
## Call:
## coxph(formula = Surv(y, del) ~ z1 + z2)
## 
##       coef exp(coef) se(coef)     z      p
## z1 1.47664   4.37822  0.08043 18.36 <2e-16
## z2 1.52088   4.57626  0.08064 18.86 <2e-16
## 
## Likelihood ratio test=550  on 2 df, p=< 2.2e-16
## n= 1000, number of events= 860

** Qualitative Interaction - Direction of z1 effect depends on z2:**

n <- 1000
z1 <- rbinom(n,size=1,prob=0.5)
z2 <- rbinom(n,size=1,prob=0.5)
x <- rexp(n,rate=exp(z1 + z2 - 3*z1*z2))
y <- pmin(1,x) ## observations censored at time 1
del <- 1*(x < 1) ## indicator for censoring

## very significant pvalue, correct analysis
coxph(Surv(y,del)~z1+z2+z1*z2)
## Call:
## coxph(formula = Surv(y, del) ~ z1 + z2 + z1 * z2)
## 
##           coef exp(coef) se(coef)       z      p
## z1     0.96668   2.62920  0.10591   9.127 <2e-16
## z2     1.11276   3.04275  0.10718  10.382 <2e-16
## z1:z2 -3.15940   0.04245  0.17448 -18.107 <2e-16
## 
## Likelihood ratio test=429.9  on 3 df, p=< 2.2e-16
## n= 1000, number of events= 712
## leaving out interaction we only get an average of effects
## which results in a negative coefficient because the
## absolute size of the - 3*z1*z2 is larger than the z1 term
coxph(Surv(y,del)~z1+z2)
## Call:
## coxph(formula = Surv(y, del) ~ z1 + z2)
## 
##       coef exp(coef) se(coef)      z        p
## z1 -0.3974    0.6721   0.0789 -5.037 4.73e-07
## z2 -0.2927    0.7463   0.0789 -3.709 0.000208
## 
## Likelihood ratio test=55.04  on 2 df, p=1.12e-12
## n= 1000, number of events= 712

Summary

  • It is possible to over control, e.g. usually controlling for a mediator is a bad idea.
  • Randomized experiments disable any arrows into \(z_1\), thus eliminating confounders.
  • The concepts confounder and adjusting for a covariate are usually linked with causality even if not explicitly stated this way.
  • The above examples provide some causal context for Section 8.7 of the textbook. See Pearl 2009 “Causality” for the whole story.

Book Approach to Model Building

The book advocates building the model by sequentially adding covariates one at a time, choosing either the covariate with the smallest p-value from a local test or the covariate producing the lowest AIC. The procedure terminates when all potential new covariates do not meet the p-value threshold (usually 0.05) or increase the AIC. This approach is sometimes known as forward model selection. In the case where one predictor is of primary interest, say \(z_1\), the book recommends including this predictor first in the model.

Opinion: When the goal is to understand the causal impact of some variable on \(Y\), I find this approach weak because it does not operate in a formal causal framework. The final interpretation of the model coefficients is unclear. When the goal is building a predictive model, the approach is more justifiable. In this case the model predictions, in the form a survival curve for each set of covariates is of primary interest (rather than model coefficients). There are many more modern methods, such as penalized regression, that will probably work better than the forward selection approach discussed here. We will study some of those methods later in the course.

Assessing Causal Relationships: Bone Marrow Transplant Data

Goals and Exploratory Data Analysis

We would like to understand the associated between disease type and survival. Does disease type itself cause survival or are there confounders which cause both?

data(bmt)
?bmt

table(bmt$group)
## 
##  1  2  3 
## 38 54 45
disease <- c("ALL","AML Low","AML High")
bmt$disease <- as.factor(disease[bmt$group])
table(bmt$disease)
## 
##      ALL AML High  AML Low 
##       38       45       54
fit <- survfit(Surv(t2,d3)~disease,data=bmt)

par(mar=c(5,5,1,1))
plot(fit,xlab="Time",lty=1:3,col=1:3)
legend("topright",names(fit$strata),lty=1:3,col=1:3)

We should discuss potential confounders with doctors and biologists. Graphs can help generate theories. For example what about age of patient and wait time to transplant as potential confounders.

par(mfcol=c(1,2))

ggplot(bmt, aes(x=z1, fill=bmt$disease)) +
  geom_density(alpha=0.3) + xlab("Age of Patient")

ggplot(bmt, aes(x=z7, fill=bmt$disease)) +
  geom_density(alpha=0.3) + xlab("Wait Time to Transplant") + 
  xlim(0,500)
## Warning: Removed 17 rows containing non-finite values (stat_density).

Note the relationship between the survival curves and the Wait Time to Transplant densities. Perhaps

\[ \text{Disease } \rightarrow \text{ Transplant Wait Time } \rightarrow \text{Survival} \]

  • Disease \(\rightarrow\) Transplant relationship may be the result of algorithms which decide who gets a transplant first. These algorithms may favor AML Low over ALL over AML High.
  • Transplant Wait Time \(\rightarrow\) Survival relationship may be a result of cancer progression while waiting for transplant, the longer you wait, the more the cancer grows, thus decreasing survival probability.

Supposing this is the case, Disease does have a causal effect on Survival but not through intrinsic biology, but rather through wait times. Future research should focus on reducing wait times, not understanding the biology of the diseases.

The above paragraphs are speculation, but this sort of thinking is useful for generating possible causal relations and the appropriate resulting policy actions. The data can be used to assess the degree of support for these ideas.

Forward Model Selection (Example 8.5)

We now illustrate forward selection with p-values and AIC. Fit the base model with the primary covariate of interest, disease.

?bmt
fit <- coxph(Surv(t2,d3)~disease,data=bmt)
fit
## Call:
## coxph(formula = Surv(t2, d3) ~ disease, data = bmt)
## 
##                    coef exp(coef) se(coef)      z      p
## diseaseAML High  0.3834    1.4673   0.2674  1.434 0.1516
## diseaseAML Low  -0.5742    0.5632   0.2873 -1.999 0.0457
## 
## Likelihood ratio test=13.45  on 2 df, p=0.001199
## n= 137, number of events= 83
fit$wald.test
## [1] 13.03152
pchisq(fit$wald.test,df=2,lower.tail=FALSE)
## [1] 0.001479933
aic_base <- -2*fit$loglik[2] + 2*length(fit$coef)
aic_base
## [1] 737.1393

Now consider adding each variable to the model. First we consider waiting time to transplant:

fit <- coxph(Surv(t2,d3)~disease+z7,data=bmt,method="breslow")
fit
## Call:
## coxph(formula = Surv(t2, d3) ~ disease + z7, data = bmt, method = "breslow")
## 
##                       coef  exp(coef)   se(coef)      z     p
## diseaseAML High  0.3150426  1.3703177  0.2712213  1.162 0.245
## diseaseAML Low  -0.6969556  0.4980994  0.3020352 -2.308 0.021
## z7              -0.0003924  0.9996077  0.0003610 -1.087 0.277
## 
## Likelihood ratio test=14.77  on 3 df, p=0.002022
## n= 137, number of events= 83
fit$wald.test
## [1] 14.30763
## compute p-value for local likelihood ratio test
w <- fit$coef[3]^2/fit$var[3,3]
pchisq(w,df=1,lower.tail=FALSE)
##        z7 
## 0.2770364
## compute AIC
-2*fit$loglik[2] + 2*length(fit$coef)
## [1] 737.947
aic_base
## [1] 737.1393

If we repeat this analysis for all covariates we get the following table:

Table 8.6

Table 8.6

The textbook advocates adding the covariate with the lowest p-value or lowest AIC to the model, in this case FAB. This process is repeated on the remaining covariates, with disease type and FAB in model until no covariates pass the p-value threshold of 0.05 or decrease the AIC.

Predictive Modeling

In example 8.6 the book also discusses forward selection when the primary goal is to find a model predictive of the distribution of time to weaning. Later We are mainly interested in finding factors which contribute to the distribution of time to weaning.

The p-value and AIC criteria are proxies for quality improvement of the predictive model when adding a covariate. We discuss other measures of predictive performance of a model after Spring break. These methods will enable comparison of more general models.

Generating Survival Curves

One we have estimated the parametric component of the Cox PH model with \(\widehat{\beta}\), we can determine the nonparametric baseline hazard component as

\[ \widehat{\Lambda_0}(t) = \sum_{t_i \leq t} \frac{d_i}{\sum_{j \in R(t_i)} exp(\widehat{\beta}^T Z_j)} \]

The resulting baseline survival function is \(\widehat{S}_0(t) = exp(-\widehat{\Lambda}_0(t))\) and the survival function for an individual with covariates \(Z\) is

\[ \widehat{S}(t|Z) = \widehat{S}_0(t)^{exp(\widehat{\beta}^TZ)} \]

The observation specific survival functions can be found by running survfit on a coxph object in R.

fit <- coxph(Surv(t2,d3)~disease,data=bmt,method="breslow")
fit
## Call:
## coxph(formula = Surv(t2, d3) ~ disease, data = bmt, method = "breslow")
## 
##                    coef exp(coef) se(coef)      z      p
## diseaseAML High  0.3826    1.4661   0.2674  1.431 0.1524
## diseaseAML Low  -0.5742    0.5632   0.2873 -1.999 0.0457
## 
## Likelihood ratio test=13.43  on 2 df, p=0.001212
## n= 137, number of events= 83
## predict survival curve for observations 1 and 100
bmt_sub <- bmt[c(1,100),]
out <- survfit(fit,newdata=bmt_sub)
str(out)
## List of 14
##  $ n        : int 137
##  $ time     : num [1:130] 1 2 10 16 32 35 47 48 53 55 ...
##  $ n.risk   : num [1:130] 137 136 135 134 133 132 131 129 127 126 ...
##  $ n.event  : num [1:130] 1 1 1 1 1 1 2 2 1 1 ...
##  $ n.censor : num [1:130] 0 0 0 0 0 0 0 0 0 0 ...
##  $ surv     : num [1:130, 1:2] 0.993 0.985 0.978 0.97 0.963 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr [1:2] "1" "100"
##  $ type     : chr "right"
##  $ cumhaz   : num [1:130, 1:2] 0.00744 0.01494 0.02252 0.03013 0.03783 ...
##  $ std.err  : num [1:130, 1:2] 0.00755 0.01089 0.01359 0.01597 0.01818 ...
##  $ lower    : num [1:130, 1:2] 0.978 0.964 0.952 0.94 0.929 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr [1:2] "1" "100"
##  $ upper    : num [1:130, 1:2] 1 1 1 1 0.998 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr [1:2] "1" "100"
##  $ conf.type: chr "log"
##  $ conf.int : num 0.95
##  $ call     : language survfit(formula = fit, newdata = bmt_sub)
##  - attr(*, "class")= chr [1:2] "survfit.cox" "survfit"
plot(out,col=1:2)

Optional Exercises / Work