Solutions to Homework 5

Possible Solution

There are many possible solutions, here is one.

Introduction

Our goal is to use gene expression and single nucleotide variant data to predict survival outcomes for patients with Renal Cell Carcinoma, the KIRC data set in TCGA.

We first load the data, remove genes with 0 expression, and plot a Kaplan-Meier survival curve for all of the data.

setwd("~/Desktop/survival/docs/hw")
library(survival)
library(ggplot2)
library(ggfortify)
library(glmnet)

## Loading required package: Matrix

## Loading required package: foreach

## Loaded glmnet 2.0-16

load("tcga_kirc.RData")
## to keep things fast, we only use N rna expressions
##N <- 1000
##rna <- rna[,1:N]
rna <- rna[,colSums(rna)!=0]

## two individuals have censoring at time 0, remove
ix <- dat$event_time!=0
dat <- dat[ix,]
rna <- rna[ix,]
snv <- snv[ix,]

## transform rna to log scale, add 1 to avoid NAs
rna <- log10(rna + 1)

x <- cbind(rna,snv)
y <- Surv(dat$event_time,dat$obs_death)
fit <- survfit(Surv(event_time,obs_death)~1,data=dat)
autoplot(fit,xlab="Survival Time (days)",ylab="Survival",ylim=c(0,1))

## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.

Building Models

set.seed(02042019) ## for reproducibility
ix <- sample(1:nrow(x),size=floor(nrow(x)/2))
y_tr <- y[ix]
x_tr <- x[ix,]
y_te <- y[-ix]
x_te <- x[-ix,]

We first fit two models without any screening of covariates. For the first model we use LASSO (\(\lambda=1\)) and for the second model we use an even mixture of LASSO and ridge penalties by setting \(\alpha=0.5\).

cnames <- c("No Screen, alpha=1","No Screen, alpha=0.5",
                     "Screen, alpha=1","Screen, alpha=0.5","small 10 pvalue")
preds <- matrix(0,ncol=length(cnames),nrow=length(y_te))
colnames(preds) <- cnames
cvfit_tr <- vector("list",ncol(preds))

cvfit_tr[[1]] <- cv.glmnet(x_tr,y_tr,family="cox")
preds[,1] <- predict(cvfit_tr[[1]],x_te,s="lambda.min")[,1]

cvfit_tr[[2]] <- cv.glmnet(x_tr,y_tr,family="cox",alpha=0.5)
preds[,2] <- predict(cvfit_tr[[2]],x_te,s="lambda.min")

We now perform univariate screen and fit a LASSO (\(\lambda=1\)) and an even mixture of LASSO and ridge penalties by setting \(\alpha=0.5\) for **a subset of screened variables. We also consider the simple strategy of fitting a unregularized survival model to the predictors with smallest 10 p-values.

ps <- rep(1,ncol(x))
names(ps) <- colnames(x)
for(ii in 1:ncol(rna)){
  fit <- suppressWarnings(coxph(y~x[,ii]))
  ps[ii] <- pchisq(fit$wald.test,df=1,lower.tail=FALSE)
}
summary(ps)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.000000 0.004245 0.091394 0.248655 0.440189 1.000000

hist(ps)

## only use predictors with very small p-values
touse <- ps < 0.0001
mean(touse)

## [1] 0.1045365

## subset of predictors we will use
x_tr_sub <- x_tr[,touse]
x_te_sub <- x_te[,touse]

cvfit_tr[[3]] <- cv.glmnet(x_tr_sub,y_tr,family="cox")
preds[,3] <- predict(cvfit_tr[[3]],x_te_sub,s="lambda.min")

cvfit_tr[[4]] <- cv.glmnet(x_tr_sub,y_tr,family="cox",alpha=0.5)
preds[,4] <- predict(cvfit_tr[[4]],x_te_sub,s="lambda.min")

touse <- ps <= ps[order(ps)][10]
x_tr_10 <- x_tr[,touse]
x_te_10 <- x_te[,touse]
fit <- coxph(y_tr~x_tr_10)
preds[,5] <- colSums(t(x_te_10)*coef(fit))

Concordance Index Model Assessment

We compute the C-index for each model. They perform similarly. Surprising that simply chosing the coefficients with the 10 smallest p-values works so well.

library(survC1)

cind <- rep(0,ncol(preds))
for(ii in 1:ncol(preds)){
  mydata <- data.frame(as.matrix(y_te),preds[,ii])
  out <- Est.Cval(mydata, 2000, nofit=TRUE)
  cind[ii] <- out$Dhat
}
names(cind) <- colnames(preds)

cind

##   No Screen, alpha=1 No Screen, alpha=0.5      Screen, alpha=1 
##            0.6561115            0.6434531            0.6582485 
##    Screen, alpha=0.5      small 10 pvalue 
##            0.6608595            0.6723809

Log Rank Test Model Assessment

We can also assess each model’s performance by splitting the test observations into three groups based on risk score, plotting Kaplan-Meier curves for each group, and computing p-values with log rank tests.

## split observations into three groups
levs <- preds
for(ii in 1:ncol(preds)){
  levs[,ii] <- cut_number(preds[,ii],3)
}

plot_store <- vector("list",ncol(levs))

for(ii in 1:ncol(levs)){
  fit <- survfit(y_te~levs[,ii])
  out <- survdiff(y_te~levs[,ii])
  p.val <- 1 - pchisq(out$chisq, length(out$n) - 1)
  plot_store[[ii]] <- autoplot(fit,
                               xlab="Survival Time (days)",
                               ylab="Survival",
                               main=paste0(colnames(levs)[ii]," p-value: ",round(p.val,6)))
}



library(gtable)
library(grid)
library(gridExtra)

#grid.arrange(plot_store[[1]],plot_store[[2]],
#             plot_store[[3]],plot_store[[4]],
#             nrow=2,ncol=2)

plot_store[[1]]

plot_store[[2]]

plot_store[[3]]

plot_store[[4]]

plot_store[[5]]

Model Comparison

We can compare the linear predictions made by each model. They are in close agreement with the exception of the p-value method.

pairs(preds)

Model Complexity

The LASSO models (\(\alpha=1\)) have fewer non-zero coefficients and will be simpler to interpret than the models with \(\alpha=0.5\) because attention can be focused on a smaller number of gene expressions / single nucleotide varients. This result, in combination with the p-value result, leads us to favor the two \(\alpha=1\) models over the \(\alpha=0.5\) models. However extremely simple strategy of aggressive prescreening (only using top 10 covariates based on training set p-value screen) actually works best because it has highest C-index, smaller p-value from the log rank test, and only 10 coefficients to interpret.

num_nonzero <- rep(10,ncol(preds))
names(num_nonzero) <- colnames(preds)
for(ii in 1:4){
  co_mat <- as.matrix(coef(cvfit_tr[[ii]],s="lambda.min"))
  num_nonzero[ii] <- sum(co_mat[,1]!=0)
}
num_nonzero

##   No Screen, alpha=1 No Screen, alpha=0.5      Screen, alpha=1 
##                   13                   25                   14 
##    Screen, alpha=0.5      small 10 pvalue 
##                   38                   10