Please submit code along with solutions.

Question 1 (10 points)

Using the KIRC TCGA data set we discussed in class:

The report should be about 2 written pages and include a few plots / tables. Begin the report with an introductory paragraph about what you are trying to accomplish.

Possible Solution

There are many possible solutions, here is one.


Our goal is to use gene expression and single nucleotide variant data to predict survival outcomes for patients with Renal Cell Carcinoma, the KIRC data set in TCGA.

We first load the data, remove genes with 0 expression, and plot a Kaplan-Meier survival curve for all of the data.

## Loading required package: Matrix
## Loading required package: foreach
## Loaded glmnet 2.0-16
## to keep things fast, we only use N rna expressions
##N <- 1000
##rna <- rna[,1:N]
rna <- rna[,colSums(rna)!=0]

## two individuals have censoring at time 0, remove
ix <- dat$event_time!=0
dat <- dat[ix,]
rna <- rna[ix,]
snv <- snv[ix,]

## transform rna to log scale, add 1 to avoid NAs
rna <- log10(rna + 1)

x <- cbind(rna,snv)
y <- Surv(dat$event_time,dat$obs_death)
fit <- survfit(Surv(event_time,obs_death)~1,data=dat)
autoplot(fit,xlab="Survival Time (days)",ylab="Survival",ylim=c(0,1))
## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.

Building Models

set.seed(02042019) ## for reproducibility
ix <- sample(1:nrow(x),size=floor(nrow(x)/2))
y_tr <- y[ix]
x_tr <- x[ix,]
y_te <- y[-ix]
x_te <- x[-ix,]

We first fit two models without any screening of covariates. For the first model we use LASSO (\(\lambda=1\)) and for the second model we use an even mixture of LASSO and ridge penalties by setting \(\alpha=0.5\).

cnames <- c("No Screen, alpha=1","No Screen, alpha=0.5",
                     "Screen, alpha=1","Screen, alpha=0.5","small 10 pvalue")
preds <- matrix(0,ncol=length(cnames),nrow=length(y_te))
colnames(preds) <- cnames
cvfit_tr <- vector("list",ncol(preds))
cvfit_tr[[1]] <- cv.glmnet(x_tr,y_tr,family="cox")
preds[,1] <- predict(cvfit_tr[[1]],x_te,s="lambda.min")[,1]
cvfit_tr[[2]] <- cv.glmnet(x_tr,y_tr,family="cox",alpha=0.5)
preds[,2] <- predict(cvfit_tr[[2]],x_te,s="lambda.min")

We now perform univariate screen and fit a LASSO (\(\lambda=1\)) and an even mixture of LASSO and ridge penalties by setting \(\alpha=0.5\) for **a subset of screened variables. We also consider the simple strategy of fitting a unregularized survival model to the predictors with smallest 10 p-values.

ps <- rep(1,ncol(x))
names(ps) <- colnames(x)
for(ii in 1:ncol(rna)){
  fit <- suppressWarnings(coxph(y~x[,ii]))
  ps[ii] <- pchisq(fit$wald.test,df=1,lower.tail=FALSE)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.000000 0.004245 0.091394 0.248655 0.440189 1.000000

## only use predictors with very small p-values
touse <- ps < 0.0001
## [1] 0.1045365
## subset of predictors we will use
x_tr_sub <- x_tr[,touse]
x_te_sub <- x_te[,touse]
cvfit_tr[[3]] <- cv.glmnet(x_tr_sub,y_tr,family="cox")
preds[,3] <- predict(cvfit_tr[[3]],x_te_sub,s="lambda.min")
cvfit_tr[[4]] <- cv.glmnet(x_tr_sub,y_tr,family="cox",alpha=0.5)
preds[,4] <- predict(cvfit_tr[[4]],x_te_sub,s="lambda.min")
touse <- ps <= ps[order(ps)][10]
x_tr_10 <- x_tr[,touse]
x_te_10 <- x_te[,touse]
fit <- coxph(y_tr~x_tr_10)
preds[,5] <- colSums(t(x_te_10)*coef(fit))

Concordance Index Model Assessment

We compute the C-index for each model. They perform similarly. Surprising that simply chosing the coefficients with the 10 smallest p-values works so well.


cind <- rep(0,ncol(preds))
for(ii in 1:ncol(preds)){
  mydata <- data.frame(as.matrix(y_te),preds[,ii])
  out <- Est.Cval(mydata, 2000, nofit=TRUE)
  cind[ii] <- out$Dhat
names(cind) <- colnames(preds)

##   No Screen, alpha=1 No Screen, alpha=0.5      Screen, alpha=1 
##            0.6561115            0.6434531            0.6582485 
##    Screen, alpha=0.5      small 10 pvalue 
##            0.6608595            0.6723809

Log Rank Test Model Assessment

We can also assess each model’s performance by splitting the test observations into three groups based on risk score, plotting Kaplan-Meier curves for each group, and computing p-values with log rank tests.

## split observations into three groups
levs <- preds
for(ii in 1:ncol(preds)){
  levs[,ii] <- cut_number(preds[,ii],3)

plot_store <- vector("list",ncol(levs))

for(ii in 1:ncol(levs)){
  fit <- survfit(y_te~levs[,ii])
  out <- survdiff(y_te~levs[,ii])
  p.val <- 1 - pchisq(out$chisq, length(out$n) - 1)
  plot_store[[ii]] <- autoplot(fit,
                               xlab="Survival Time (days)",
                               main=paste0(colnames(levs)[ii]," p-value: ",round(p.val,6)))


#             plot_store[[3]],plot_store[[4]],
#             nrow=2,ncol=2)