Please submit code along with solutions.

Using the KIRC TCGA data set we discussed in class:

- Perform some exploratory data analysis to clean and summarize data (e.g.Â remove genes with all 0 expression).
- Split data into training and test sets, use both the SNV and gene expression data. Use
`set.seed`

to make your results reproducible. - Build at least two Cox PH models on the training set. Could involve:
- Different values of \(\alpha\) (balance between \(\ell_1\) and \(\ell_2\) regularization)
- Prescreening versus no prescreening of predictors (use only training data for prescreening)
- Different criteria for selecting tuning parameters, e.g.Â built-in deviance measure versus Concordance index.
- Remove genes with low expression levels.

- Measure and report the performance of the models on the test set. In addition comment on model complexity as measured by number of covariates used. Which model do you prefer and for what purpose?

The report should be about 2 written pages and include a few plots / tables. Begin the report with an introductory paragraph about what you are trying to accomplish.

There are many possible solutions, here is one.

Our goal is to use gene expression and single nucleotide variant data to predict survival outcomes for patients with Renal Cell Carcinoma, the KIRC data set in TCGA.

We first load the data, remove genes with 0 expression, and plot a Kaplan-Meier survival curve for all of the data.

```
setwd("~/Desktop/survival/docs/hw")
library(survival)
library(ggplot2)
library(ggfortify)
library(glmnet)
```

`## Loading required package: Matrix`

`## Loading required package: foreach`

`## Loaded glmnet 2.0-16`

```
load("tcga_kirc.RData")
## to keep things fast, we only use N rna expressions
##N <- 1000
##rna <- rna[,1:N]
rna <- rna[,colSums(rna)!=0]
## two individuals have censoring at time 0, remove
ix <- dat$event_time!=0
dat <- dat[ix,]
rna <- rna[ix,]
snv <- snv[ix,]
## transform rna to log scale, add 1 to avoid NAs
rna <- log10(rna + 1)
x <- cbind(rna,snv)
y <- Surv(dat$event_time,dat$obs_death)
fit <- survfit(Surv(event_time,obs_death)~1,data=dat)
autoplot(fit,xlab="Survival Time (days)",ylab="Survival",ylim=c(0,1))
```

```
## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.
```

```
set.seed(02042019) ## for reproducibility
ix <- sample(1:nrow(x),size=floor(nrow(x)/2))
y_tr <- y[ix]
x_tr <- x[ix,]
y_te <- y[-ix]
x_te <- x[-ix,]
```

We first fit two models without any screening of covariates. For the first model we use LASSO (\(\lambda=1\)) and for the second model we use an even mixture of LASSO and ridge penalties by setting \(\alpha=0.5\).

```
cnames <- c("No Screen, alpha=1","No Screen, alpha=0.5",
"Screen, alpha=1","Screen, alpha=0.5","small 10 pvalue")
preds <- matrix(0,ncol=length(cnames),nrow=length(y_te))
colnames(preds) <- cnames
cvfit_tr <- vector("list",ncol(preds))
```

```
cvfit_tr[[1]] <- cv.glmnet(x_tr,y_tr,family="cox")
preds[,1] <- predict(cvfit_tr[[1]],x_te,s="lambda.min")[,1]
```

```
cvfit_tr[[2]] <- cv.glmnet(x_tr,y_tr,family="cox",alpha=0.5)
preds[,2] <- predict(cvfit_tr[[2]],x_te,s="lambda.min")
```

We now perform univariate screen and fit a LASSO (\(\lambda=1\)) and an even mixture of LASSO and ridge penalties by setting \(\alpha=0.5\) for **a subset of screened variables. We also consider the simple strategy of fitting a unregularized survival model to the predictors with smallest 10 p-values.

```
ps <- rep(1,ncol(x))
names(ps) <- colnames(x)
for(ii in 1:ncol(rna)){
fit <- suppressWarnings(coxph(y~x[,ii]))
ps[ii] <- pchisq(fit$wald.test,df=1,lower.tail=FALSE)
}
summary(ps)
```

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000000 0.004245 0.091394 0.248655 0.440189 1.000000
```

`hist(ps)`

```
## only use predictors with very small p-values
touse <- ps < 0.0001
mean(touse)
```

`## [1] 0.1045365`

```
## subset of predictors we will use
x_tr_sub <- x_tr[,touse]
x_te_sub <- x_te[,touse]
```

```
cvfit_tr[[3]] <- cv.glmnet(x_tr_sub,y_tr,family="cox")
preds[,3] <- predict(cvfit_tr[[3]],x_te_sub,s="lambda.min")
```

```
cvfit_tr[[4]] <- cv.glmnet(x_tr_sub,y_tr,family="cox",alpha=0.5)
preds[,4] <- predict(cvfit_tr[[4]],x_te_sub,s="lambda.min")
```

```
touse <- ps <= ps[order(ps)][10]
x_tr_10 <- x_tr[,touse]
x_te_10 <- x_te[,touse]
fit <- coxph(y_tr~x_tr_10)
preds[,5] <- colSums(t(x_te_10)*coef(fit))
```

We compute the C-index for each model. They perform similarly. Surprising that simply chosing the coefficients with the 10 smallest p-values works so well.

```
library(survC1)
cind <- rep(0,ncol(preds))
for(ii in 1:ncol(preds)){
mydata <- data.frame(as.matrix(y_te),preds[,ii])
out <- Est.Cval(mydata, 2000, nofit=TRUE)
cind[ii] <- out$Dhat
}
names(cind) <- colnames(preds)
cind
```

```
## No Screen, alpha=1 No Screen, alpha=0.5 Screen, alpha=1
## 0.6561115 0.6434531 0.6582485
## Screen, alpha=0.5 small 10 pvalue
## 0.6608595 0.6723809
```

We can also assess each modelâ€™s performance by splitting the test observations into three groups based on risk score, plotting Kaplan-Meier curves for each group, and computing p-values with log rank tests.

```
## split observations into three groups
levs <- preds
for(ii in 1:ncol(preds)){
levs[,ii] <- cut_number(preds[,ii],3)
}
plot_store <- vector("list",ncol(levs))
for(ii in 1:ncol(levs)){
fit <- survfit(y_te~levs[,ii])
out <- survdiff(y_te~levs[,ii])
p.val <- 1 - pchisq(out$chisq, length(out$n) - 1)
plot_store[[ii]] <- autoplot(fit,
xlab="Survival Time (days)",
ylab="Survival",
main=paste0(colnames(levs)[ii]," p-value: ",round(p.val,6)))
}
library(gtable)
library(grid)
library(gridExtra)
#grid.arrange(plot_store[[1]],plot_store[[2]],
# plot_store[[3]],plot_store[[4]],
# nrow=2,ncol=2)
plot_store[[1]]
```