Tuning Parameter Selection and Evaluating Predictive Performance of Survival Models

Model Performance Metrics

We need objective measures of model performance for two reasons:

Select tuning parameters, e.g. \(\alpha\) and \(\lambda\) in elastic net Cox PH
Compare two models, e.g. elastic net Cox PH and Survival Tree

Since we might wish to compare very different models (Cox PH and Survival Trees), we need metrics which are general. Since we may need to choose between many values of the tuning parameters, we need methods which are fast.

Traditional Performance Metrics

Discrete Outcome Y
- Accuracy
- AUC
- Brier Score (probabilistic predictions)
Continuous Outcome
- Mean Squared Error (MSE)
- Mean Absolute Error (MAD)

More generally one can choose a loss function and evaluate model by computing expected loss (risk).

Training Data Optimism and Cross Validation

Consider the following procedure:

Fit the model on the data:
- \(Y \in \mathbb{R}^n\) is response
- \(X \in \mathbb{R}^{n \times p}\) are predictors
- \(\widehat{y} = \widehat{f}(x)\) is prediction of \(y\) at \(x\)
  - e.g. \(\widehat{f}(x) = x^T\widehat{\beta}\)
Measure error using \(Y\) and \(\widehat{Y}\)
- \(MSE(\widehat{\beta}) = \sum_{i=1}^n (y_i - \widehat{y}_i)^2\)

Problem:

The error rate when evaluating the model on the same data used to fit model parameters is optimistic
Favors more complex models. Known as overfitting.

Overfitting Example

library(glmnet)

## Loading required package: Matrix

## Loading required package: foreach

## Loaded glmnet 2.0-16

## simulate data
n <- 200
p <- 200
x <- matrix(rnorm(n*p),ncol=p)
## 10 nonzero predictors
beta <- matrix(c(rep(1,10),rep(0,p-10)),ncol=1)
y <- as.vector(x%*%beta + rnorm(n))
fit <- glmnet(x,y)
length(fit$lambda)

## [1] 90

head(fit$lambda)

## [1] 1.0915623 0.9945909 0.9062342 0.8257268 0.7523715 0.6855329

## predict responses at every lambda 
out <- predict(fit,newx=x)
dim(out)

## [1] 200  90

mse <- colMeans((out - y)^2)

## create new data from same model (same beta)
## and evaluate performance
xnew <- matrix(rnorm(n*p),ncol=p)
ynew <- as.vector(xnew%*%beta + rnorm(n))
out <- predict(fit,newx=xnew)
msenew <- colMeans((out - ynew)^2)
plot(log(fit$lambda),mse,xlim=log(c(0.01,.5)),ylim=c(0,4),
     lwd=2,xlab="log(Lambda)",ylab="Mean Squared Error",type='l',xaxs='i')
points(log(fit$lambda),msenew,type='l',col='red',lwd=2)
legend("topleft",c("Training Error","Test Error"),col=1:2,lty=1,lwd=2)
abline(h=1)

Cross Validation

In cross validation the model coefficient are estimated using K-1 parts of the data and then the model is assessed on the left-out-part. The model left out part is rotated so a total of \(K\) models are fit, for each value of the tuning parameter. The results are averaged across the \(K\) model fits and the tuning parameter with the best performance is chosen.

Image Source: https://blog.contactsunny.com/data-science/different-types-of-validations-in-machine-learning-cross-validation

cv.glmnet function: Automated Cross Validation

?cv.glmnet
## type.measure argument controls loss function
## used in cross validation, by default deviance which
## is mean squared error for linear model
fit <- cv.glmnet(x,y)
plot(fit)

The best \(\lambda\) could be the one that minimizes the loss. Alternatively, one may prefer larger \(\lambda\) because these will result in simplier models. One common heuristic is to use the largest value of \(\lambda\) such that error is within 1 standard error of the minimum.

## smaller lambda, less regularized model
fit$lambda.min

## [1] 0.08067433

## larger lambda, more regularized model
fit$lambda.1se

## [1] 0.1409807

We can then make predictions as whatever \(\lambda\) we choose on the new data

co <- coef(fit,s="lambda.min")[,1]
co[co!=0]

##  (Intercept)           V1           V2           V3           V4 
##  0.090696839  0.891696353  0.867475732  0.642077880  0.894187744 
##           V5           V6           V7           V8           V9 
##  0.775660347  0.790438370  1.011599583  0.914656913  0.829438643 
##          V10          V12          V18          V21          V23 
##  0.987413569  0.050442878 -0.007367044 -0.054472698 -0.008028036 
##          V31          V36          V43          V47          V53 
## -0.073157852  0.011927470  0.039444106 -0.028428816  0.007067304 
##          V54          V58          V59          V60          V63 
## -0.019844852 -0.084101031 -0.018005176  0.048594519 -0.037940066 
##          V64          V65          V75          V77          V79 
##  0.011299910 -0.051896826  0.009095553 -0.093304281 -0.021098696 
##          V87          V97          V98         V111         V112 
##  0.059935405  0.063929557  0.031123124  0.028977200 -0.056881374 
##         V129         V130         V134         V144         V146 
## -0.174727808 -0.059791292 -0.032612072 -0.006126300 -0.051495808 
##         V151         V153         V154         V156         V161 
## -0.002727842 -0.048631467 -0.004557928 -0.026063554  0.015530244 
##         V164         V170         V174         V175         V179 
## -0.003120284 -0.007678634 -0.042784911  0.007693365 -0.086917426 
##         V184         V186         V194         V198         V199 
## -0.110030409  0.030282243 -0.068976037 -0.016428873 -0.007936008

yhat <- predict(fit,newx=xnew,s="lambda.min")[,1]
mean((yhat-ynew)^2)

## [1] 1.345198

Literature Sources for Cross Validation

An Introduction to Statistical Learning
- Freely available online
- Advanced undergraduate level
- See Chapter 5 for cross validation
The Elements of Statistical Learning
- Freely available online
- More advanced
- See Chapter 7 for cross validation, assessing model performance

Measures of Model Performance with Survival Data

We cannot use mean squared error for assessing model performance because:

The Cox PH model returns a hazard function for each possible covariate vector \(z\). But there is no single prediction time (could summarize distribution with mean, but not always defined in Cox PH).
We cannot compute a mean squared error for right censored observations.

We discuss three alternative strategies now.

Method 1. Partial Log Likelihood

An alternative strategy is to compute the mean of the likelihood on the left-out-set. This works for most models but fails for the Cox PH due to the semi-parametric nature of the problem.

The paper (Houwelingen et al. 2006) proposed assessing the goodness of fit for a particular \(\lambda\) using

\[ \widehat{CV}_{i}(\lambda) = LL(\widehat{\beta}_{-i}(\lambda)) - LL_{-i}(\widehat{\beta}_{-i}(\lambda)) \] where \(\widehat{\beta}_{-i}(\lambda)\) is the parameter estimate leaving out part \(i\) of the data and \(LL_{-i}\) is the log likelihood leaving out part \(i\) of the data.

The glmnet function assesses model performance using this metric. See (Simon et al. 2011) and (Houwelingen et al. 2006) for details.

Method 2: Kaplan-Meier and p-values on Independent Set

After tuning the model, we need to assess its performance at the chosen value of the tuning parameters. Again, making this assessment on the data used to choose the tuning parameters leads to optimistic assessments. One strategy is to split the data into a training and test (or validation set). The cross validation is done on the training set while the final model performance is measured on the validation set.

One could again assess model performance using deviance like statistics on the validation set. An alternative approach is to do the following, suggested in (Witten and Tibshirani 2010):

Compute the risk scores (\(z^T\widehat{\beta}\) in the proportional hazards model) on the validation set.
Break the risk scores into low, medium, and high (or just low and high) categories.
Fit Kaplan-Meier curves to the categories and compute the p-values.

The p-value provides a measure of how well the model is stratifying risk sets.

Method 3. Concordance Index

Let \(\widehat{\beta}\) be a set of estimated predictors and \(T_1\) and \(T_2\) a pair of death times with associated predictors \(Z_1\) and \(Z_2\). The Concordance Index measure the level of agreement between the predicted risk and the order of the death times. For observation 1, the risk of death is proportional to \(\widehat{\beta}^TZ_1\) and for observation 2 the risk of death is proportional to \(\widehat{\beta}^T Z_2\). The \(C\) index is defined as

\[ C(\widehat{\beta}) = P(\widehat{\beta}^TZ_1 > \widehat{\beta}^T Z_2 | T_1 < T_2) \] \(C=1/2\) for a useless model and \(1\) for a perfect model (with most data sets, not acheivable).

Since we do not always observe death times, we must estimate the concordance using the censored data. The package survC1 with methodology developed in (Uno et al. 2011) has consistent estimators for concordance with censored data.

Demonstration on TCGA KIRC Data

load("tcga_kirc.RData")
ls()

##  [1] "beta"   "co"     "dat"    "fit"    "mse"    "msenew" "n"     
##  [8] "out"    "p"      "rna"    "snv"    "x"      "xnew"   "y"     
## [15] "yhat"   "ynew"

dim(rna)

## [1]   378 20531

dim(snv)

## [1] 378  44

dim(dat)

## [1] 378  61

## event_time (death or censoring time) and obs_death (censoring variable)
## are of most interest
tail(colnames(dat))

## [1] "days_to_hiv_diagnosis" "ajcc_clinical_stage"   "days_to_new_event"    
## [4] "hiv_positive"          "event_time"            "obs_death"

## to keep things fast, we only use N rna expressions
##N <- 1000
##rna <- rna[,1:N]
rna <- rna[,colSums(rna)!=0]

library(survival)

## two individuals have censoring at time 0, remove
ix <- dat$event_time!=0
sum(!ix)

## [1] 2

dat <- dat[ix,]
rna <- rna[ix,]
snv <- snv[ix,]

## transform rna to log scale, add 1 to avoid NAs
rna <- log10(rna + 1)

x <- cbind(rna,snv)
y <- Surv(dat$event_time,dat$obs_death)

1. Split data into training and test.

set.seed(02042019) ## for reproducibility
ix <- sample(1:nrow(x),size=floor(nrow(x)/2))
y_tr <- y[ix]
x_tr <- x[ix,]
y_te <- y[-ix]
x_te <- x[-ix,]

2. Fit Regularized Cox Model on training using cross validation to select \(\lambda\)

cvfit_tr <- cv.glmnet(x_tr,y_tr,family="cox")
plot(cvfit_tr)

3. With selected model, make linear predictors on test data, split into two or three groups

co_mat <- as.matrix(coef(cvfit_tr,s="lambda.min"))
co_mat[co_mat[,1]!=0,]

## ADAMTS14|140766  ANKRD56|345079  C12orf32|83695 C6orf118|168090 
##     0.068916370    -0.027531687     0.377424912     0.198601589 
##     CKAP4|10970   FBXL16|146330      KCND1|3750        MX2|4600 
##     0.260782142    -0.002000004     0.205697745     0.096913992 
##    OR13A1|79290    PODNL1|79883   RILPL1|353116     SORBS2|8470 
##     0.433605372     0.020435948     0.599582159    -0.014124994 
##      WIT1|51352 
##     0.077508913

## get predictions x^Tbeta
preds <- predict(cvfit_tr,x_te,s="lambda.min")

## split into low, medium, high risk groups
library(ggplot2)
levs <- cut_number(preds,3)

4. Make Kaplan-Meier curves and Log Rank p-values for groups

fit <- survfit(y_te~levs)
out <- survdiff(y_te~levs)
out

## Call:
## survdiff(formula = y_te ~ levs)
## 
##                   N Observed Expected (O-E)^2/E (O-E)^2/V
## levs=[3.29,4.01] 63       12     16.9      1.44      2.19
## levs=(4.01,4.27] 62       11     18.4      3.01      4.78
## levs=(4.27,5.41] 63       27     14.6     10.51     15.03
## 
##  Chisq= 15.1  on 2 degrees of freedom, p= 5e-04

p.val <- 1 - pchisq(out$chisq, length(out$n) - 1)
p.val

## [1] 0.0005165814

library(ggplot2)
library(ggfortify)
autoplot(fit,xlab="Survival Time (days)",ylab="Survival",
         main=paste0("p-value: ",round(p.val,6)))

5. Assess Performance with C-index

Compute C-index on test set using methodology described in (Uno et al. 2011):

library(survC1)
mydata <- data.frame(as.matrix(y_te),preds)
out <- Est.Cval(mydata, 2000, nofit=TRUE)
cind <- out$Dhat
cind

## [1] 0.6561115

Are we doing better than random chance? We can empirically compute the null C-index distribution by generating linear predictors from normal distribution and comparing observed C-index to null distribution.

N <- 1000
chats <- rep(0,N)
mydatasim <- mydata
for(ii in 1:N){
  mydatasim[,3] <- rnorm(n=nrow(mydatasim))
  out <- Est.Cval(mydatasim,2000,nofit=TRUE)
  chats[ii] <- out$Dhat
}

hist(chats)
abline(v=cind)

mean(chats>cind)

## [1] 0

Directions for Improvement

Variable screening prior to glmnet fit. The current dimensionality (\(p \approx 50n\)) is straining what glmnet can handle.
Tuning \(\alpha\) (ratio of \(\ell_1\) to \(\ell_2\) regularization in model).
Covariate transforms other than log.
Refit model using coxph on variables identified as non-zero by glmnet.

Bibliography

Houwelingen, Hans C van, Tako Bruinsma, Augustinus AM Hart, Laura J van’t Veer, and Lodewyk FA Wessels. 2006. “Cross-Validated Cox Regression on Microarray Gene Expression Data.” Statistics in Medicine 25 (18). Wiley Online Library: 3201–16.

Simon, Noah, Jerome Friedman, Trevor Hastie, and Rob Tibshirani. 2011. “Regularization Paths for Cox’s Proportional Hazards Model via Coordinate Descent.” Journal of Statistical Software 39 (5). NIH Public Access: 1.

Uno, Hajime, Tianxi Cai, Michael J Pencina, Ralph B D’Agostino, and LJ Wei. 2011. “On the c-Statistics for Evaluating Overall Adequacy of Risk Prediction Procedures with Censored Survival Data.” Statistics in Medicine 30 (10). Wiley Online Library: 1105–17.

Witten, Daniela M, and Robert Tibshirani. 2010. “Survival Analysis with High-Dimensional Covariates.” Statistical Methods in Medical Research 19 (1). SAGE Publications Sage UK: London, England: 29–51.