```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
options(knitr.duplicate.label = 'allow')
setwd("~/Desktop/survival/docs/lectures")
```
## Model Performance Metrics
We need objective measures of model performance for two reasons:
- Select tuning parameters, e.g. $\alpha$ and $\lambda$ in elastic net Cox PH
- Compare two models, e.g. elastic net Cox PH and Survival Tree
Since we might wish to compare very different models (Cox PH and Survival Trees), we need metrics which are general. Since we may need to choose between many values of the tuning parameters, we need methods which are fast.
## Traditional Performance Metrics
- Discrete Outcome Y
- Accuracy
- AUC
- Brier Score (probabilistic predictions)
- Continuous Outcome
- Mean Squared Error (MSE)
- Mean Absolute Error (MAD)
More generally one can choose a loss function and evaluate model by computing expected loss (risk).
## Training Data Optimism and Cross Validation
Consider the following procedure:
- Fit the model on the data:
- $Y \in \mathbb{R}^n$ is response
- $X \in \mathbb{R}^{n \times p}$ are predictors
- $\widehat{y} = \widehat{f}(x)$ is prediction of $y$ at $x$
- e.g. $\widehat{f}(x) = x^T\widehat{\beta}$
- Measure error using $Y$ and $\widehat{Y}$
- $MSE(\widehat{\beta}) = \sum_{i=1}^n (y_i - \widehat{y}_i)^2$
Problem:
- The error rate when evaluating the model on the same data used to fit model parameters is **optimistic**
- Favors **more complex** models. Known as overfitting.
#### Overfitting Example
```{r glmnet_overfitting,out.width = "450px", fig.align='center'}
library(glmnet)
## simulate data
n <- 200
p <- 200
x <- matrix(rnorm(n*p),ncol=p)
## 10 nonzero predictors
beta <- matrix(c(rep(1,10),rep(0,p-10)),ncol=1)
y <- as.vector(x%*%beta + rnorm(n))
fit <- glmnet(x,y)
length(fit$lambda)
head(fit$lambda)
## predict responses at every lambda
out <- predict(fit,newx=x)
dim(out)
mse <- colMeans((out - y)^2)
## create new data from same model (same beta)
## and evaluate performance
xnew <- matrix(rnorm(n*p),ncol=p)
ynew <- as.vector(xnew%*%beta + rnorm(n))
out <- predict(fit,newx=xnew)
msenew <- colMeans((out - ynew)^2)
plot(log(fit$lambda),mse,xlim=log(c(0.01,.5)),ylim=c(0,4),
lwd=2,xlab="log(Lambda)",ylab="Mean Squared Error",type='l',xaxs='i')
points(log(fit$lambda),msenew,type='l',col='red',lwd=2)
legend("topleft",c("Training Error","Test Error"),col=1:2,lty=1,lwd=2)
abline(h=1)
```
#### Cross Validation
In cross validation the model coefficient are estimated using K-1 parts of the data and then the model is assessed on the left-out-part. The model left out part is rotated so a total of $K$ models are fit, for each value of the tuning parameter. The results are averaged across the $K$ model fits and the tuning parameter with the best performance is chosen.
```{r, out.width = "450px", fig.align='center',echo=FALSE}
knitr::include_graphics("cross_validation.png")
```
Image Source: [https://blog.contactsunny.com/data-science/different-types-of-validations-in-machine-learning-cross-validation](https://blog.contactsunny.com/data-science/different-types-of-validations-in-machine-learning-cross-validation)
#### cv.glmnet function: Automated Cross Validation
```{r glmnet_cv}
?cv.glmnet
## type.measure argument controls loss function
## used in cross validation, by default deviance which
## is mean squared error for linear model
fit <- cv.glmnet(x,y)
plot(fit)
```
The best $\lambda$ could be the one that minimizes the loss. Alternatively, one may prefer larger $\lambda$ because these will result in simplier models. One common heuristic is to use the largest value of $\lambda$ such that error is within 1 standard error of the minimum.
```{r glmnet_cv_lambda}
## smaller lambda, less regularized model
fit$lambda.min
## larger lambda, more regularized model
fit$lambda.1se
```
We can then make predictions as whatever $\lambda$ we choose on the new data
```{r glmnet_lambda_predict}
co <- coef(fit,s="lambda.min")[,1]
co[co!=0]
yhat <- predict(fit,newx=xnew,s="lambda.min")[,1]
mean((yhat-ynew)^2)
```
#### Literature Sources for Cross Validation
- [An Introduction to Statistical Learning](https://www-bcf.usc.edu/~gareth/ISL/)
- Freely available online
- Advanced undergraduate level
- See Chapter 5 for cross validation
- [The Elements of Statistical Learning](https://web.stanford.edu/~hastie/Papers/ESLII.pdf)
- Freely available online
- More advanced
- See Chapter 7 for cross validation, assessing model performance
## Measures of Model Performance with Survival Data
We cannot use mean squared error for assessing model performance because:
* The Cox PH model returns a hazard function for each possible covariate vector $z$. But there is no single prediction time (could summarize distribution with mean, but not always defined in Cox PH).
* We cannot compute a mean squared error for right censored observations.
We discuss three alternative strategies now.
#### Method 1. Partial Log Likelihood
An alternative strategy is to compute the mean of the likelihood on the left-out-set. This works for most models but fails for the Cox PH due to the semi-parametric nature of the problem.
The paper [@van2006cross] proposed assessing the goodness of fit for a particular $\lambda$ using
$$
\widehat{CV}_{i}(\lambda) = LL(\widehat{\beta}_{-i}(\lambda)) - LL_{-i}(\widehat{\beta}_{-i}(\lambda))
$$
where $\widehat{\beta}_{-i}(\lambda)$ is the parameter estimate leaving out part $i$ of the data and $LL_{-i}$ is the log likelihood leaving out part $i$ of the data.
The `glmnet` function assesses model performance using this metric. See [@simon2011regularization] and [@van2006cross] for details.
#### Method 2: Kaplan-Meier and p-values on Independent Set
After tuning the model, we need to assess its performance at the chosen value of the tuning parameters. Again, making this assessment on the data used to choose the tuning parameters leads to optimistic assessments. One strategy is to split the data into a training and test (or validation set). The cross validation is done on the training set while the final model performance is measured on the validation set.
One could again assess model performance using deviance like statistics on the validation set. An alternative approach is to do the following, suggested in [@witten2010survival]:
* Compute the risk scores ($z^T\widehat{\beta}$ in the proportional hazards model) on the validation set.
* Break the risk scores into low, medium, and high (or just low and high) categories.
* Fit Kaplan-Meier curves to the categories and compute the p-values.
The p-value provides a measure of how well the model is stratifying risk sets.
#### Method 3. Concordance Index
Let $\widehat{\beta}$ be a set of estimated predictors and $T_1$ and $T_2$ a pair of death times with associated predictors $Z_1$ and $Z_2$. The Concordance Index measure the level of agreement between the predicted risk and the order of the death times. For observation 1, the risk of death is proportional to $\widehat{\beta}^TZ_1$ and for observation 2 the risk of death is proportional to $\widehat{\beta}^T Z_2$. The $C$ index is defined as
$$
C(\widehat{\beta}) = P(\widehat{\beta}^TZ_1 > \widehat{\beta}^T Z_2 | T_1 < T_2)
$$
$C=1/2$ for a useless model and $1$ for a perfect model (with most data sets, not acheivable).
Since we do not always observe death times, we must estimate the concordance using the censored data. The package `survC1` with methodology developed in [@uno2011c] has consistent estimators for concordance with censored data.
## Demonstration on TCGA KIRC Data
```{r load_data}
load("tcga_kirc.RData")
ls()
dim(rna)
dim(snv)
dim(dat)
## event_time (death or censoring time) and obs_death (censoring variable)
## are of most interest
tail(colnames(dat))
## to keep things fast, we only use N rna expressions
##N <- 1000
##rna <- rna[,1:N]
rna <- rna[,colSums(rna)!=0]
library(survival)
## two individuals have censoring at time 0, remove
ix <- dat$event_time!=0
sum(!ix)
dat <- dat[ix,]
rna <- rna[ix,]
snv <- snv[ix,]
## transform rna to log scale, add 1 to avoid NAs
rna <- log10(rna + 1)
x <- cbind(rna,snv)
y <- Surv(dat$event_time,dat$obs_death)
```
#### 1. Split data into training and test.
```{r split_data}
set.seed(02042019) ## for reproducibility
ix <- sample(1:nrow(x),size=floor(nrow(x)/2))
y_tr <- y[ix]
x_tr <- x[ix,]
y_te <- y[-ix]
x_te <- x[-ix,]
```
#### 2. Fit Regularized Cox Model on training using cross validation to select $\lambda$
```{r fit_cox}
cvfit_tr <- cv.glmnet(x_tr,y_tr,family="cox")
plot(cvfit_tr)
```
#### 3. With selected model, make linear predictors on test data, split into two or three groups
```{r predict}
co_mat <- as.matrix(coef(cvfit_tr,s="lambda.min"))
co_mat[co_mat[,1]!=0,]
## get predictions x^Tbeta
preds <- predict(cvfit_tr,x_te,s="lambda.min")
## split into low, medium, high risk groups
library(ggplot2)
levs <- cut_number(preds,3)
```
#### 4. Make Kaplan-Meier curves and Log Rank p-values for groups
```{r kaplan_meier_curves}
fit <- survfit(y_te~levs)
out <- survdiff(y_te~levs)
out
p.val <- 1 - pchisq(out$chisq, length(out$n) - 1)
p.val
library(ggplot2)
library(ggfortify)
autoplot(fit,xlab="Survival Time (days)",ylab="Survival",
main=paste0("p-value: ",round(p.val,6)))
```
#### 5. Assess Performance with C-index
Compute C-index on test set using methodology described in [@uno2011c]:
```{r c_index}
library(survC1)
mydata <- data.frame(as.matrix(y_te),preds)
out <- Est.Cval(mydata, 2000, nofit=TRUE)
cind <- out$Dhat
cind
```
Are we doing better than random chance? We can empirically compute the null C-index distribution by generating linear predictors from normal distribution and comparing observed C-index to null distribution.
```{r c_index_null_dist}
N <- 1000
chats <- rep(0,N)
mydatasim <- mydata
for(ii in 1:N){
mydatasim[,3] <- rnorm(n=nrow(mydatasim))
out <- Est.Cval(mydatasim,2000,nofit=TRUE)
chats[ii] <- out$Dhat
}
hist(chats)
abline(v=cind)
mean(chats>cind)
```
#### Directions for Improvement
* Variable screening prior to `glmnet` fit. The current dimensionality ($p \approx 50n$) is straining what `glmnet` can handle.
* Tuning $\alpha$ (ratio of $\ell_1$ to $\ell_2$ regularization in model).
* Covariate transforms other than log.
* Refit model using `coxph` on variables identified as non-zero by `glmnet`.
## Bibliography
```{r makedotRfile, echo=FALSE, include=FALSE,eval=TRUE}
## why does this not work?
## automatrically generates .R file when running knitr
knitr::purl("07prediction.Rmd")
```