Section 8.7 discusses two possible approaches to model building based on what the model is intended to do. The model may be used to:
The book mostly uses p-values and Akaike Information Criteria (AIC) in the context of forward stepwise regression for both of these tasks. While these are useful tools, the book does not discuss:
I believe b. is particularly underrated in the statistics community. For example the book discusses “adjusting for potential confounders” but a satisfactory definition of confounding requires a formal causal inference framework.
We discuss a. and a little bit of b. here as well as the p-value and AIC used in the textbook. After Spring Break, we will discuss c. in the context of high dimensional survival prediction.
Here we assume the primary interest is the \(Z_1\), \(Y\) relationship and \(Z_2\) is another variable we may add to the model.
We suppose \(Z_2 \rightarrow Z_1\) and \(Z_2 \rightarrow Y\). There is no arrow from \(Z_1\) to \(Y\), so \(Z_1\) has no causal effect on \(Y\). Here \(Z_2\) confounds the \(Z_1\), \(Y\) relationship and is important to put it in the model.
n <- 1000
z2 <- rnorm(n)
z1 <- 2*z2 + 1 + rnorm(n)
x <- rexp(n,rate=exp(z2))
y <- pmin(1,x) ## observations censored at time 1
del <- 1*(x < 1) ## indicator for censoring
## very significant pvalue, not controlling for z2
coxph(Surv(y,del)~z1)
## Call:
## coxph(formula = Surv(y, del) ~ z1)
##
## coef exp(coef) se(coef) z p
## z1 0.40704 1.50237 0.02064 19.72 <2e-16
##
## Likelihood ratio test=406.5 on 1 df, p=< 2.2e-16
## n= 1000, number of events= 617
## after controlling for z2, not significant
## true coefficient on z1 is 0
coxph(Surv(y,del)~z1+z2)
## Call:
## coxph(formula = Surv(y, del) ~ z1 + z2)
##
## coef exp(coef) se(coef) z p
## z1 0.04227 1.04317 0.04243 0.996 0.319
## z2 0.96705 2.63018 0.09853 9.815 <2e-16
##
## Likelihood ratio test=505.2 on 2 df, p=< 2.2e-16
## n= 1000, number of events= 617
We suppose \(Z_1 \rightarrow Z_2 \rightarrow Y\). Here \(Z_2\) is a mediator. Controlling for \(Z_2\) results in the wrong conclusions.
n <- 1000
z1 <- rnorm(n)
z2 <- 2*z1 + 1 + rnorm(n)
x <- rexp(n,rate=exp(z2))
y <- pmin(1,x) ## observations censored at time 1
del <- 1*(x < 1) ## indicator for censoring
## very significant pvalue, correct analysis
coxph(Surv(y,del)~z1)
## Call:
## coxph(formula = Surv(y, del) ~ z1)
##
## coef exp(coef) se(coef) z p
## z1 1.34662 3.84440 0.04896 27.51 <2e-16
##
## Likelihood ratio test=798.9 on 1 df, p=< 2.2e-16
## n= 1000, number of events= 729
## after controlling for z2, not significant
## incorrect analysis
coxph(Surv(y,del)~z1+z2)
## Call:
## coxph(formula = Surv(y, del) ~ z1 + z2)
##
## coef exp(coef) se(coef) z p
## z1 0.11367 1.12038 0.07569 1.502 0.133
## z2 0.97626 2.65450 0.04350 22.440 <2e-16
##
## Likelihood ratio test=1340 on 2 df, p=< 2.2e-16
## n= 1000, number of events= 729
We suppose \(Z_1 \rightarrow Y\) and \(Z_2 \rightarrow Y\) but \(Z_1\) and \(Z_2\) are independent. \(Z_1\) and \(Z_2\) can interact in the model which means the effect of \(Z_1\) on \(Y\) depends on the value of \(Z_2\).
** No Interaction Case:**
n <- 1000
z1 <- rbinom(n,size=1,prob=0.5)
z2 <- rbinom(n,size=1,prob=0.5)
x <- rexp(n,rate=exp(z1 + z2))
y <- pmin(1,x) ## observations censored at time 1
del <- 1*(x < 1) ## indicator for censoring
## very significant pvalue, correct analysis
coxph(Surv(y,del)~z1+z2)
## Call:
## coxph(formula = Surv(y, del) ~ z1 + z2)
##
## coef exp(coef) se(coef) z p
## z1 1.02488 2.78675 0.07204 14.23 <2e-16
## z2 0.99199 2.69661 0.07170 13.84 <2e-16
##
## Likelihood ratio test=356.5 on 2 df, p=< 2.2e-16
## n= 1000, number of events= 878
## leaving out z2 is suboptimal, but not too bad
coxph(Surv(y,del)~z1)
## Call:
## coxph(formula = Surv(y, del) ~ z1)
##
## coef exp(coef) se(coef) z p
## z1 0.89620 2.45027 0.07029 12.75 <2e-16
##
## Likelihood ratio test=163.9 on 1 df, p=< 2.2e-16
## n= 1000, number of events= 878
** Quantitative Interaction - Magnitude of z1 effect depends on z2:**
n <- 1000
z1 <- rbinom(n,size=1,prob=0.5)
z2 <- rbinom(n,size=1,prob=0.5)
x <- rexp(n,rate=exp(z1 + z2 + z1*z2))
y <- pmin(1,x) ## observations censored at time 1
del <- 1*(x < 1) ## indicator for censoring
## very significant pvalue, correct analysis
coxph(Surv(y,del)~z1+z2+z1*z2)
## Call:
## coxph(formula = Surv(y, del) ~ z1 + z2 + z1 * z2)
##
## coef exp(coef) se(coef) z p
## z1 0.9157 2.4986 0.1041 8.793 < 2e-16
## z2 0.9534 2.5945 0.1051 9.072 < 2e-16
## z1:z2 1.1381 3.1209 0.1481 7.683 1.56e-14
##
## Likelihood ratio test=608 on 3 df, p=< 2.2e-16
## n= 1000, number of events= 860
## leaving out interaction is suboptimal, but not too bad
coxph(Surv(y,del)~z1+z2)
## Call:
## coxph(formula = Surv(y, del) ~ z1 + z2)
##
## coef exp(coef) se(coef) z p
## z1 1.47664 4.37822 0.08043 18.36 <2e-16
## z2 1.52088 4.57626 0.08064 18.86 <2e-16
##
## Likelihood ratio test=550 on 2 df, p=< 2.2e-16
## n= 1000, number of events= 860
** Qualitative Interaction - Direction of z1 effect depends on z2:**
n <- 1000
z1 <- rbinom(n,size=1,prob=0.5)
z2 <- rbinom(n,size=1,prob=0.5)
x <- rexp(n,rate=exp(z1 + z2 - 3*z1*z2))
y <- pmin(1,x) ## observations censored at time 1
del <- 1*(x < 1) ## indicator for censoring
## very significant pvalue, correct analysis
coxph(Surv(y,del)~z1+z2+z1*z2)
## Call:
## coxph(formula = Surv(y, del) ~ z1 + z2 + z1 * z2)
##
## coef exp(coef) se(coef) z p
## z1 0.96668 2.62920 0.10591 9.127 <2e-16
## z2 1.11276 3.04275 0.10718 10.382 <2e-16
## z1:z2 -3.15940 0.04245 0.17448 -18.107 <2e-16
##
## Likelihood ratio test=429.9 on 3 df, p=< 2.2e-16
## n= 1000, number of events= 712
## leaving out interaction we only get an average of effects
## which results in a negative coefficient because the
## absolute size of the - 3*z1*z2 is larger than the z1 term
coxph(Surv(y,del)~z1+z2)
## Call:
## coxph(formula = Surv(y, del) ~ z1 + z2)
##
## coef exp(coef) se(coef) z p
## z1 -0.3974 0.6721 0.0789 -5.037 4.73e-07
## z2 -0.2927 0.7463 0.0789 -3.709 0.000208
##
## Likelihood ratio test=55.04 on 2 df, p=1.12e-12
## n= 1000, number of events= 712
The book advocates building the model by sequentially adding covariates one at a time, choosing either the covariate with the smallest p-value from a local test or the covariate producing the lowest AIC. The procedure terminates when all potential new covariates do not meet the p-value threshold (usually 0.05) or increase the AIC. This approach is sometimes known as forward model selection. In the case where one predictor is of primary interest, say \(z_1\), the book recommends including this predictor first in the model.
Opinion: When the goal is to understand the causal impact of some variable on \(Y\), I find this approach weak because it does not operate in a formal causal framework. The final interpretation of the model coefficients is unclear. When the goal is building a predictive model, the approach is more justifiable. In this case the model predictions, in the form a survival curve for each set of covariates is of primary interest (rather than model coefficients). There are many more modern methods, such as penalized regression, that will probably work better than the forward selection approach discussed here. We will study some of those methods later in the course.
We would like to understand the associated between disease type and survival. Does disease type itself cause survival or are there confounders which cause both?
data(bmt)
?bmt
table(bmt$group)
##
## 1 2 3
## 38 54 45
disease <- c("ALL","AML Low","AML High")
bmt$disease <- as.factor(disease[bmt$group])
table(bmt$disease)
##
## ALL AML High AML Low
## 38 45 54
fit <- survfit(Surv(t2,d3)~disease,data=bmt)
par(mar=c(5,5,1,1))
plot(fit,xlab="Time",lty=1:3,col=1:3)
legend("topright",names(fit$strata),lty=1:3,col=1:3)
We should discuss potential confounders with doctors and biologists. Graphs can help generate theories. For example what about age of patient and wait time to transplant as potential confounders.
par(mfcol=c(1,2))
ggplot(bmt, aes(x=z1, fill=bmt$disease)) +
geom_density(alpha=0.3) + xlab("Age of Patient")
ggplot(bmt, aes(x=z7, fill=bmt$disease)) +
geom_density(alpha=0.3) + xlab("Wait Time to Transplant") +
xlim(0,500)
## Warning: Removed 17 rows containing non-finite values (stat_density).
Note the relationship between the survival curves and the Wait Time to Transplant densities. Perhaps
\[ \text{Disease } \rightarrow \text{ Transplant Wait Time } \rightarrow \text{Survival} \]
Supposing this is the case, Disease does have a causal effect on Survival but not through intrinsic biology, but rather through wait times. Future research should focus on reducing wait times, not understanding the biology of the diseases.
The above paragraphs are speculation, but this sort of thinking is useful for generating possible causal relations and the appropriate resulting policy actions. The data can be used to assess the degree of support for these ideas.
We now illustrate forward selection with p-values and AIC. Fit the base model with the primary covariate of interest, disease
.
?bmt
fit <- coxph(Surv(t2,d3)~disease,data=bmt)
fit
## Call:
## coxph(formula = Surv(t2, d3) ~ disease, data = bmt)
##
## coef exp(coef) se(coef) z p
## diseaseAML High 0.3834 1.4673 0.2674 1.434 0.1516
## diseaseAML Low -0.5742 0.5632 0.2873 -1.999 0.0457
##
## Likelihood ratio test=13.45 on 2 df, p=0.001199
## n= 137, number of events= 83
fit$wald.test
## [1] 13.03152
pchisq(fit$wald.test,df=2,lower.tail=FALSE)
## [1] 0.001479933
aic_base <- -2*fit$loglik[2] + 2*length(fit$coef)
aic_base
## [1] 737.1393
Now consider adding each variable to the model. First we consider waiting time to transplant:
fit <- coxph(Surv(t2,d3)~disease+z7,data=bmt,method="breslow")
fit
## Call:
## coxph(formula = Surv(t2, d3) ~ disease + z7, data = bmt, method = "breslow")
##
## coef exp(coef) se(coef) z p
## diseaseAML High 0.3150426 1.3703177 0.2712213 1.162 0.245
## diseaseAML Low -0.6969556 0.4980994 0.3020352 -2.308 0.021
## z7 -0.0003924 0.9996077 0.0003610 -1.087 0.277
##
## Likelihood ratio test=14.77 on 3 df, p=0.002022
## n= 137, number of events= 83
fit$wald.test
## [1] 14.30763
## compute p-value for local likelihood ratio test
w <- fit$coef[3]^2/fit$var[3,3]
pchisq(w,df=1,lower.tail=FALSE)
## z7
## 0.2770364
## compute AIC
-2*fit$loglik[2] + 2*length(fit$coef)
## [1] 737.947
aic_base
## [1] 737.1393
If we repeat this analysis for all covariates we get the following table:
Table 8.6
The textbook advocates adding the covariate with the lowest p-value or lowest AIC to the model, in this case FAB. This process is repeated on the remaining covariates, with disease type and FAB in model until no covariates pass the p-value threshold of 0.05 or decrease the AIC.
In example 8.6 the book also discusses forward selection when the primary goal is to find a model predictive of the distribution of time to weaning. Later We are mainly interested in finding factors which contribute to the distribution of time to weaning.
The p-value and AIC criteria are proxies for quality improvement of the predictive model when adding a covariate. We discuss other measures of predictive performance of a model after Spring break. These methods will enable comparison of more general models.
One we have estimated the parametric component of the Cox PH model with \(\widehat{\beta}\), we can determine the nonparametric baseline hazard component as
\[ \widehat{\Lambda_0}(t) = \sum_{t_i \leq t} \frac{d_i}{\sum_{j \in R(t_i)} exp(\widehat{\beta}^T Z_j)} \]
The resulting baseline survival function is \(\widehat{S}_0(t) = exp(-\widehat{\Lambda}_0(t))\) and the survival function for an individual with covariates \(Z\) is
\[ \widehat{S}(t|Z) = \widehat{S}_0(t)^{exp(\widehat{\beta}^TZ)} \]
The observation specific survival functions can be found by running survfit
on a coxph
object in R
.
fit <- coxph(Surv(t2,d3)~disease,data=bmt,method="breslow")
fit
## Call:
## coxph(formula = Surv(t2, d3) ~ disease, data = bmt, method = "breslow")
##
## coef exp(coef) se(coef) z p
## diseaseAML High 0.3826 1.4661 0.2674 1.431 0.1524
## diseaseAML Low -0.5742 0.5632 0.2873 -1.999 0.0457
##
## Likelihood ratio test=13.43 on 2 df, p=0.001212
## n= 137, number of events= 83
## predict survival curve for observations 1 and 100
bmt_sub <- bmt[c(1,100),]
out <- survfit(fit,newdata=bmt_sub)
str(out)
## List of 14
## $ n : int 137
## $ time : num [1:130] 1 2 10 16 32 35 47 48 53 55 ...
## $ n.risk : num [1:130] 137 136 135 134 133 132 131 129 127 126 ...
## $ n.event : num [1:130] 1 1 1 1 1 1 2 2 1 1 ...
## $ n.censor : num [1:130] 0 0 0 0 0 0 0 0 0 0 ...
## $ surv : num [1:130, 1:2] 0.993 0.985 0.978 0.97 0.963 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : NULL
## .. ..$ : chr [1:2] "1" "100"
## $ type : chr "right"
## $ cumhaz : num [1:130, 1:2] 0.00744 0.01494 0.02252 0.03013 0.03783 ...
## $ std.err : num [1:130, 1:2] 0.00755 0.01089 0.01359 0.01597 0.01818 ...
## $ lower : num [1:130, 1:2] 0.978 0.964 0.952 0.94 0.929 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : NULL
## .. ..$ : chr [1:2] "1" "100"
## $ upper : num [1:130, 1:2] 1 1 1 1 0.998 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : NULL
## .. ..$ : chr [1:2] "1" "100"
## $ conf.type: chr "log"
## $ conf.int : num 0.95
## $ call : language survfit(formula = fit, newdata = bmt_sub)
## - attr(*, "class")= chr [1:2] "survfit.cox" "survfit"
plot(out,col=1:2)