```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
options(knitr.duplicate.label = 'allow')
setwd("~/Desktop/survival/docs/lectures")
```
## Load and Clean Data
We use the cleaned TCGA KIRC data available at the course website.
```{r load_data}
load("tcga_kirc.RData")
ls()
dim(rna)
dim(snv)
dim(dat)
## event_time (death or censoring time) and obs_death (censoring variable)
## are of most interest
tail(colnames(dat))
## to keep things fast, we only use N rna expressions
N <- 1000
rna <- rna[,1:N]
```
```{r clean_data, fig.align="center",out.width = "400px"}
## remove all rna with 0 expression for all subjects
rna <- rna[,colSums(rna)!=0]
hist(log10(rna))
```
## What Features Are Significantly Associated with Survival
Consider testing the null hypothesis of a gene $i$ having no effect on survival. Assuming any relationship follows a Cox PH model, i.e. harzard function of
$$
\lambda(t|z_j) = \lambda_0(t)e^{\beta_j z_j}
$$
then the hypotheses are
$$
H_0: \beta_j = 0\\
H_a: \beta_j \neq 0
$$
We compute p-values for all of these `r ncol(rna)` hypotheses:
```{r cox_p_values, fig.align="center",out.width = "400px"}
library(survival)
ps <- rep(1,ncol(rna))
names(ps) <- colnames(rna)
y <- Surv(dat$event_time,dat$obs_death)
for(ii in 1:ncol(rna)){
fit <- coxph(y~rna[,ii])
ps[ii] <- pchisq(fit$wald.test,df=1,lower.tail=FALSE)
}
summary(ps)
hist(ps)
```
Questions:
* Which genes are truly associated with survival?
* Which genes should we include in a multivariate model?
If RNA expression were completely unrelated to survival, these p-values would follow a uniform[0,1] distribution. To confirm this, we run a simulation:
```{r null_simulation,out.width = "400px",fig.align="center"}
rna_sim <- matrix(rnorm(nrow(rna)*ncol(rna)),nrow=nrow(rna),ncol=ncol(rna))
colnames(rna_sim) <- colnames(rna)
ps_sim <- rep(1,ncol(rna_sim))
names(ps) <- colnames(rna_sim)
y <- Surv(dat$event_time,dat$obs_death)
for(ii in 1:ncol(rna_sim)){
fit <- coxph(y~rna_sim[,ii])
ps_sim[ii] <- pchisq(fit$wald.test,df=1,lower.tail=FALSE)
}
hist(ps_sim)
```
#### Error Rates
Consider the following hypothetical p-value distribution:
```{r bum_fig, out.width = "400px",fig.align="center",echo=FALSE,include=TRUE}
knitr::include_graphics("https://longjp.github.io/survival/lectures/bum_dist.png")
```
Using the p-value distribution we can:
1. **Determine False Discovery Rate for Set of Selected Genes**: Suppose we decide to label all genes with p-values less than $\tau$ as significant. What is our False Positive Rate? We have $A + C$ significant genes, of which, $C$ are false positives. So our False Positive Rate or False Discovery Rate is
$$
FDR(\tau) = \frac{C}{A + C}
$$
2. **Determine False Discovery Rate for Particular Gene** Let the density height at $\tau$ be $f(\tau)$. Then the False Discovery rate at $\tau$, sometimes called the Local False Discovery Rate, is
$$
LFDR(\tau) = \frac{\pi}{f(\tau)}
$$
This is the the probability the the gene is not associated with the response (i.e. posterior probability the null hypothesis is true). For a review of local false discovery rate and empirical bayes estimation see [@efron2001empirical].
#### Beta Uniform Mixture (BUM) Model
The Beta Uniform Mixture (BUM) model has been proposed as an approximation to the p-value density [@pounds2003estimating]. A $Beta(\alpha,1)$ density has the form
$$
f(x|\alpha) = \alpha x^{\alpha-1}
$$
with support on $x \in [0,1]$. Letting $\lambda$ represent the component proportions, the distribution of p-values is modeled as
$$
f(x|\lambda,\alpha) = \lambda \times \underbrace{1}_{\text{Unform}} + (1-\lambda)\underbrace{\alpha x^{\alpha-1}}_{\text{Beta}}
$$
We can estimate the $\lambda$ and $\alpha$ using maximum likelihood estimation (often with an EM algorithm to maximize the likelihood) and obtain parameter estimates $\widehat{\lambda}$ and $\widehat{\alpha}$. Let $\pi$ denote the proportion of true null hypotheses. Then $\pi$ is estimated as the greatest uniform component which can be removed from this density:
$$
\widehat{\pi} = \widehat{\lambda} + (1-\widehat{\lambda})\widehat{\alpha}
$$
Denote the estimated BUM CDF by $F(x|\widehat{\lambda},\widehat{\alpha})$. Returning
1. **Determine False Discovery Rate for Set of Selected Genes**: We can estimate $C$ with $\tau \widehat{\pi}$ and $A + C$ with $F(\tau|\widehat{\lambda},\widehat{\alpha})$, so our estimate of the FDR at $\tau$ is
$$
\widehat{FDR}(\tau) = \frac{\tau \widehat{\pi}}{F(\tau|\widehat{\lambda},\widehat{\alpha})}
$$
Further if we have a specific $FDR$ we would like to target, we can choose a $\tau$ to acheive this threshold.
2. **Determine False Discovery Rate for Particular Gene** The estimated Local False Discovery Rate, is
$$
\widehat{LFDR}(\tau) = \frac{\widehat{\pi}}{f(\tau|\widehat{\lambda},\widehat{\alpha})}
$$
And recall the probability the alternative is true given a p-value of $\tau$ is $p(H_a|\tau) = 1 - \widehat{LFDR}(\tau)$.
#### Implementation of BUM Model
```{r bum_model_fit,out.width = "400px",fig.align="center"}
library(ClassComparison)
out <- Bum(ps)
out@ahat
out@lhat
out@lhat + (1-out@lhat)*out@ahat
out@pihat
hist(out,breaks=20)
abline(v=0.1,col="red",lwd=3)
```
We can also use the function `selectSignificant` to select genes which pass some posterior probability or FDR threshold.
```{r use_built_in}
gene_sel <- selectSignificant(out,alpha=0.99999,by="EmpiricalBayes")
colnames(rna)[gene_sel]
gene_sel <- selectSignificant(out,alpha=0.00001,by="FDR")
colnames(rna)[gene_sel]
```
#### Using These Results
Depending on our inference goal, the proper threshold will differ. For example if we are prescreening genes to add to a regression model we may accept a liberal threshold which admits many genes. The regression model itself will minimize the impact of unrelated genes by setting their coefficients to (near) 0 in the model. In contrast if we are reporting a set a genes for publication, we may want a small set in which we have high confidence of a true effect.
## Regularized Cox Proportional Hazards Model
Recall the Cox Proportional Hazards model Breslow partial likelihood is
$$
LL_1(\beta) = \sum_{i=1}^D \left(\beta^T s_i - d_i \log \left(\sum_{j \in R(i)} exp(\beta^T Z_j)\right)\right)
$$
We showed that when $p > n$, for any $\beta$ there are an infinite number of $\beta'$ such that $LL(\beta) = LL(\beta')$. This implies that maximizers of the log likelihood will not be unique (if one exists, there will be an infinite number). Alternatively none may exist.
To address this issue, we can **regularize** the optimization problem by adding a penalty term. This strategy is known by many names including regularized, penalized, or shrinkage estimators. The typical strategy is to add a term to the likelihood which ensures a unique maximizer at some plausible $\beta$. In many examples we prefer small $\beta$ corresponding to small effects. Or even better, solutions where many $\beta_j=0$, representing a **sparse** solution in which only a subset of the predictor variables have any influence on the outcome. This can be acheived with the *elastic net* penalty, a mixture of $\ell_1$ (sometimes called Lasso penalty) and $\ell_2$ (sometimes called ridge) regularizers [@zou2005regularization]. This method can be used with many models including standard linear regression.
Instead of maximizing $LL_1$ we maximize [@simon2011regularization]
$$
PLL(\beta) = \frac{2}{n}LL(\beta) - \lambda\left(\alpha \overbrace{\sum_{j=1}^p |\beta_j|}^{\ell_1 \text{penalty}} + \frac{1}{2}(1-\alpha) \overbrace{\sum_{j=1}^p \beta_j^2}^{\ell_2 \text{ penalty}}\right)
$$
Notes on the penalty term:
* The penalty terms grow larger as $|\beta_j|$ grow, thus encouraging solutions near $0$.
* The tuning parameters $\lambda$ and $\alpha$ are generally chosen to optimize some metric, such as AIC, accuracy, or Concordance index. We discuss tuning parameter selection later.
* $\lambda$ controls how much penalty. Larger $\lambda$, smaller coefficients.
* $\alpha$ controls the balance of $\ell_1$ and $\ell_2$ in the penalty.
* The geometry of the $\ell_1$ penalty encourages sparse solutions which use a small number of variables. The selected variables can be unstable (very different set of selected variables if one perturbs the data), so the $\ell_2$ ridge penalty is used to stabilize the selected set.
The regularized Cox Model likelihood has a unique maximum because the Hessian is negative definite. See [https://www4.stat.ncsu.edu/~dzhang2/st745/chap7.pdf](here) for an explanation.
#### Implementation of Cox glmnet
```{r coxnet}
library(glmnet)
summary(dat$event_time)
## two individuals have censoring at time 0, remove
ix <- dat$event_time!=0
sum(!ix)
dat <- dat[ix,]
rna <- rna[ix,]
snv <- snv[ix,]
x <- cbind(rna,snv)
y <- Surv(dat$event_time,dat$obs_death)
?glmnet
fit <- glmnet(x,y,family="cox")
## regularization parameters
length(fit$lambda)
head(fit$lambda)
## what is largest coefficient
co <- coef(fit)
dim(co)
## rows are variables, columns are different choices of lambda
## we get a different set of coefficients for each lambda
```
As $\lambda$ decreases, the size of the coefficients will go up. For example, consider the largest absolute coefficient:
```{r lambda_fit_relations}
fit$lambda[1]
max(abs(co[,1]))
fit$lambda[50]
max(abs(co[,1]))
fit$lambda[100]
max(abs(co[,100]))
```
Large $\lambda$ produce sparse solutions which only use a small subset of all possible predictor variables
```{r lambda_sparsity}
beta <- co[,20]
length(beta)
sum(beta==0)
beta[beta!=0]
```
We can visualize the path of the coefficients, how they change as the regularization parameter $\lambda$ changes:
```{r coefficient_paths,out.width = "400px",fig.align="center"}
plot(fit)
```
#### Upcoming Topics
Next lecture we will discuss:
* Tuning parameter selection for elastic net.
* Concordance index for measuring model performance.
* Homework 5.
```{r makedotRfile, echo=FALSE, include=FALSE,eval=FALSE}
## why does this not work?
## automatrically generates .R file when running knitr
knitr::purl("06coxph.Rmd")
```
## References