Model Performance Metrics

We need objective measures of model performance for two reasons:

Since we might wish to compare very different models (Cox PH and Survival Trees), we need metrics which are general. Since we may need to choose between many values of the tuning parameters, we need methods which are fast.

Traditional Performance Metrics

More generally one can choose a loss function and evaluate model by computing expected loss (risk).

Training Data Optimism and Cross Validation

Consider the following procedure:

Problem:

Overfitting Example

library(glmnet)
## Loading required package: Matrix
## Loading required package: foreach
## Loaded glmnet 2.0-16
## simulate data
n <- 200
p <- 200
x <- matrix(rnorm(n*p),ncol=p)
## 10 nonzero predictors
beta <- matrix(c(rep(1,10),rep(0,p-10)),ncol=1)
y <- as.vector(x%*%beta + rnorm(n))
fit <- glmnet(x,y)
length(fit$lambda)
## [1] 90
head(fit$lambda)
## [1] 1.0915623 0.9945909 0.9062342 0.8257268 0.7523715 0.6855329
## predict responses at every lambda 
out <- predict(fit,newx=x)
dim(out)
## [1] 200  90
mse <- colMeans((out - y)^2)

## create new data from same model (same beta)
## and evaluate performance
xnew <- matrix(rnorm(n*p),ncol=p)
ynew <- as.vector(xnew%*%beta + rnorm(n))
out <- predict(fit,newx=xnew)
msenew <- colMeans((out - ynew)^2)
plot(log(fit$lambda),mse,xlim=log(c(0.01,.5)),ylim=c(0,4),
     lwd=2,xlab="log(Lambda)",ylab="Mean Squared Error",type='l',xaxs='i')
points(log(fit$lambda),msenew,type='l',col='red',lwd=2)
legend("topleft",c("Training Error","Test Error"),col=1:2,lty=1,lwd=2)
abline(h=1)

Cross Validation

In cross validation the model coefficient are estimated using K-1 parts of the data and then the model is assessed on the left-out-part. The model left out part is rotated so a total of \(K\) models are fit, for each value of the tuning parameter. The results are averaged across the \(K\) model fits and the tuning parameter with the best performance is chosen.

Image Source: https://blog.contactsunny.com/data-science/different-types-of-validations-in-machine-learning-cross-validation

cv.glmnet function: Automated Cross Validation

?cv.glmnet
## type.measure argument controls loss function
## used in cross validation, by default deviance which
## is mean squared error for linear model
fit <- cv.glmnet(x,y)
plot(fit)

The best \(\lambda\) could be the one that minimizes the loss. Alternatively, one may prefer larger \(\lambda\) because these will result in simplier models. One common heuristic is to use the largest value of \(\lambda\) such that error is within 1 standard error of the minimum.

## smaller lambda, less regularized model
fit$lambda.min
## [1] 0.08067433
## larger lambda, more regularized model
fit$lambda.1se
## [1] 0.1409807

We can then make predictions as whatever \(\lambda\) we choose on the new data

co <- coef(fit,s="lambda.min")[,1]
co[co!=0]
##  (Intercept)           V1           V2           V3           V4 
##  0.090696839  0.891696353  0.867475732  0.642077880  0.894187744 
##           V5           V6           V7           V8           V9 
##  0.775660347  0.790438370  1.011599583  0.914656913  0.829438643 
##          V10          V12          V18          V21          V23 
##  0.987413569  0.050442878 -0.007367044 -0.054472698 -0.008028036 
##          V31          V36          V43          V47          V53 
## -0.073157852  0.011927470  0.039444106 -0.028428816  0.007067304 
##          V54          V58          V59          V60          V63 
## -0.019844852 -0.084101031 -0.018005176  0.048594519 -0.037940066 
##          V64          V65          V75          V77          V79 
##  0.011299910 -0.051896826  0.009095553 -0.093304281 -0.021098696 
##          V87          V97          V98         V111         V112 
##  0.059935405  0.063929557  0.031123124  0.028977200 -0.056881374 
##         V129         V130         V134         V144         V146 
## -0.174727808 -0.059791292 -0.032612072 -0.006126300 -0.051495808 
##         V151         V153         V154         V156         V161 
## -0.002727842 -0.048631467 -0.004557928 -0.026063554  0.015530244 
##         V164         V170         V174         V175         V179 
## -0.003120284 -0.007678634 -0.042784911  0.007693365 -0.086917426 
##         V184         V186         V194         V198         V199 
## -0.110030409  0.030282243 -0.068976037 -0.016428873 -0.007936008
yhat <- predict(fit,newx=xnew,s="lambda.min")[,1]
mean((yhat-ynew)^2)
## [1] 1.345198

Literature Sources for Cross Validation

Measures of Model Performance with Survival Data

We cannot use mean squared error for assessing model performance because:

We discuss three alternative strategies now.

Method 1. Partial Log Likelihood

An alternative strategy is to compute the mean of the likelihood on the left-out-set. This works for most models but fails for the Cox PH due to the semi-parametric nature of the problem.

The paper (Houwelingen et al. 2006) proposed assessing the goodness of fit for a particular \(\lambda\) using

\[ \widehat{CV}_{i}(\lambda) = LL(\widehat{\beta}_{-i}(\lambda)) - LL_{-i}(\widehat{\beta}_{-i}(\lambda)) \] where \(\widehat{\beta}_{-i}(\lambda)\) is the parameter estimate leaving out part \(i\) of the data and \(LL_{-i}\) is the log likelihood leaving out part \(i\) of the data.

The glmnet function assesses model performance using this metric. See (Simon et al. 2011) and (Houwelingen et al. 2006) for details.

Method 2: Kaplan-Meier and p-values on Independent Set

After tuning the model, we need to assess its performance at the chosen value of the tuning parameters. Again, making this assessment on the data used to choose the tuning parameters leads to optimistic assessments. One strategy is to split the data into a training and test (or validation set). The cross validation is done on the training set while the final model performance is measured on the validation set.

One could again assess model performance using deviance like statistics on the validation set. An alternative approach is to do the following, suggested in (Witten and Tibshirani 2010):

  • Compute the risk scores (\(z^T\widehat{\beta}\) in the proportional hazards model) on the validation set.
  • Break the risk scores into low, medium, and high (or just low and high) categories.
  • Fit Kaplan-Meier curves to the categories and compute the p-values.

The p-value provides a measure of how well the model is stratifying risk sets.

Method 3. Concordance Index

Let \(\widehat{\beta}\) be a set of estimated predictors and \(T_1\) and \(T_2\) a pair of death times with associated predictors \(Z_1\) and \(Z_2\). The Concordance Index measure the level of agreement between the predicted risk and the order of the death times. For observation 1, the risk of death is proportional to \(\widehat{\beta}^TZ_1\) and for observation 2 the risk of death is proportional to \(\widehat{\beta}^T Z_2\). The \(C\) index is defined as

\[ C(\widehat{\beta}) = P(\widehat{\beta}^TZ_1 > \widehat{\beta}^T Z_2 | T_1 < T_2) \] \(C=1/2\) for a useless model and \(1\) for a perfect model (with most data sets, not acheivable).

Since we do not always observe death times, we must estimate the concordance using the censored data. The package survC1 with methodology developed in (Uno et al. 2011) has consistent estimators for concordance with censored data.

Demonstration on TCGA KIRC Data

load("tcga_kirc.RData")
ls()
##  [1] "beta"   "co"     "dat"    "fit"    "mse"    "msenew" "n"     
##  [8] "out"    "p"      "rna"    "snv"    "x"      "xnew"   "y"     
## [15] "yhat"   "ynew"
dim(rna)
## [1]   378 20531
dim(snv)
## [1] 378  44
dim(dat)
## [1] 378  61
## event_time (death or censoring time) and obs_death (censoring variable)
## are of most interest
tail(colnames(dat))
## [1] "days_to_hiv_diagnosis" "ajcc_clinical_stage"   "days_to_new_event"    
## [4] "hiv_positive"          "event_time"            "obs_death"
## to keep things fast, we only use N rna expressions
##N <- 1000
##rna <- rna[,1:N]
rna <- rna[,colSums(rna)!=0]

library(survival)

## two individuals have censoring at time 0, remove
ix <- dat$event_time!=0
sum(!ix)
## [1] 2
dat <- dat[ix,]
rna <- rna[ix,]
snv <- snv[ix,]

## transform rna to log scale, add 1 to avoid NAs
rna <- log10(rna + 1)

x <- cbind(rna,snv)
y <- Surv(dat$event_time,dat$obs_death)

1. Split data into training and test.

set.seed(02042019) ## for reproducibility
ix <- sample(1:nrow(x),size=floor(nrow(x)/2))
y_tr <- y[ix]
x_tr <- x[ix,]
y_te <- y[-ix]
x_te <- x[-ix,]

2. Fit Regularized Cox Model on training using cross validation to select \(\lambda\)

cvfit_tr <- cv.glmnet(x_tr,y_tr,family="cox")
plot(cvfit_tr)

3. With selected model, make linear predictors on test data, split into two or three groups

co_mat <- as.matrix(coef(cvfit_tr,s="lambda.min"))
co_mat[co_mat[,1]!=0,]
## ADAMTS14|140766  ANKRD56|345079  C12orf32|83695 C6orf118|168090 
##     0.068916370    -0.027531687     0.377424912     0.198601589 
##     CKAP4|10970   FBXL16|146330      KCND1|3750        MX2|4600 
##     0.260782142    -0.002000004     0.205697745     0.096913992 
##    OR13A1|79290    PODNL1|79883   RILPL1|353116     SORBS2|8470 
##     0.433605372     0.020435948     0.599582159    -0.014124994 
##      WIT1|51352 
##     0.077508913
## get predictions x^Tbeta
preds <- predict(cvfit_tr,x_te,s="lambda.min")

## split into low, medium, high risk groups
library(ggplot2)
levs <- cut_number(preds,3)

4. Make Kaplan-Meier curves and Log Rank p-values for groups

fit <- survfit(y_te~levs)
out <- survdiff(y_te~levs)
out
## Call:
## survdiff(formula = y_te ~ levs)
## 
##                   N Observed Expected (O-E)^2/E (O-E)^2/V
## levs=[3.29,4.01] 63       12     16.9      1.44      2.19
## levs=(4.01,4.27] 62       11     18.4      3.01      4.78
## levs=(4.27,5.41] 63       27     14.6     10.51     15.03
## 
##  Chisq= 15.1  on 2 degrees of freedom, p= 5e-04
p.val <- 1 - pchisq(out$chisq, length(out$n) - 1)
p.val
## [1] 0.0005165814
library(ggplot2)
library(ggfortify)
autoplot(fit,xlab="Survival Time (days)",ylab="Survival",
         main=paste0("p-value: ",round(p.val,6)))

5. Assess Performance with C-index

Compute C-index on test set using methodology described in (Uno et al. 2011):

library(survC1)
mydata <- data.frame(as.matrix(y_te),preds)
out <- Est.Cval(mydata, 2000, nofit=TRUE)
cind <- out$Dhat
cind
## [1] 0.6561115

Are we doing better than random chance? We can empirically compute the null C-index distribution by generating linear predictors from normal distribution and comparing observed C-index to null distribution.

N <- 1000
chats <- rep(0,N)
mydatasim <- mydata
for(ii in 1:N){
  mydatasim[,3] <- rnorm(n=nrow(mydatasim))
  out <- Est.Cval(mydatasim,2000,nofit=TRUE)
  chats[ii] <- out$Dhat
}

hist(chats)
abline(v=cind)

mean(chats>cind)
## [1] 0

Directions for Improvement

  • Variable screening prior to glmnet fit. The current dimensionality (\(p \approx 50n\)) is straining what glmnet can handle.
  • Tuning \(\alpha\) (ratio of \(\ell_1\) to \(\ell_2\) regularization in model).
  • Covariate transforms other than log.
  • Refit model using coxph on variables identified as non-zero by glmnet.

Bibliography

Houwelingen, Hans C van, Tako Bruinsma, Augustinus AM Hart, Laura J van’t Veer, and Lodewyk FA Wessels. 2006. “Cross-Validated Cox Regression on Microarray Gene Expression Data.” Statistics in Medicine 25 (18). Wiley Online Library: 3201–16.

Simon, Noah, Jerome Friedman, Trevor Hastie, and Rob Tibshirani. 2011. “Regularization Paths for Cox’s Proportional Hazards Model via Coordinate Descent.” Journal of Statistical Software 39 (5). NIH Public Access: 1.

Uno, Hajime, Tianxi Cai, Michael J Pencina, Ralph B D’Agostino, and LJ Wei. 2011. “On the c-Statistics for Evaluating Overall Adequacy of Risk Prediction Procedures with Censored Survival Data.” Statistics in Medicine 30 (10). Wiley Online Library: 1105–17.

Witten, Daniela M, and Robert Tibshirani. 2010. “Survival Analysis with High-Dimensional Covariates.” Statistical Methods in Medical Research 19 (1). SAGE Publications Sage UK: London, England: 29–51.