We use the cleaned TCGA KIRC data available at the course website.
load("tcga_kirc.RData")
ls()
## [1] "dat" "rna" "snv"
dim(rna)
## [1] 378 20531
dim(snv)
## [1] 378 44
dim(dat)
## [1] 378 61
## event_time (death or censoring time) and obs_death (censoring variable)
## are of most interest
tail(colnames(dat))
## [1] "days_to_hiv_diagnosis" "ajcc_clinical_stage" "days_to_new_event"
## [4] "hiv_positive" "event_time" "obs_death"
## to keep things fast, we only use N rna expressions
N <- 1000
rna <- rna[,1:N]
## remove all rna with 0 expression for all subjects
rna <- rna[,colSums(rna)!=0]
hist(log10(rna))
Consider testing the null hypothesis of a gene \(i\) having no effect on survival. Assuming any relationship follows a Cox PH model, i.e. harzard function of \[ \lambda(t|z_j) = \lambda_0(t)e^{\beta_j z_j} \] then the hypotheses are \[ H_0: \beta_j = 0\\ H_a: \beta_j \neq 0 \] We compute p-values for all of these 997 hypotheses:
library(survival)
ps <- rep(1,ncol(rna))
names(ps) <- colnames(rna)
y <- Surv(dat$event_time,dat$obs_death)
for(ii in 1:ncol(rna)){
fit <- coxph(y~rna[,ii])
ps[ii] <- pchisq(fit$wald.test,df=1,lower.tail=FALSE)
}
summary(ps)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000000 0.005286 0.139087 0.280805 0.508861 0.999517
hist(ps)
Questions:
If RNA expression were completely unrelated to survival, these p-values would follow a uniform[0,1] distribution. To confirm this, we run a simulation:
rna_sim <- matrix(rnorm(nrow(rna)*ncol(rna)),nrow=nrow(rna),ncol=ncol(rna))
colnames(rna_sim) <- colnames(rna)
ps_sim <- rep(1,ncol(rna_sim))
names(ps) <- colnames(rna_sim)
y <- Surv(dat$event_time,dat$obs_death)
for(ii in 1:ncol(rna_sim)){
fit <- coxph(y~rna_sim[,ii])
ps_sim[ii] <- pchisq(fit$wald.test,df=1,lower.tail=FALSE)
}
hist(ps_sim)
Consider the following hypothetical p-value distribution:
Using the p-value distribution we can:
\[ FDR(\tau) = \frac{C}{A + C} \]
The Beta Uniform Mixture (BUM) model has been proposed as an approximation to the p-value density (Pounds and Morris 2003). A \(Beta(\alpha,1)\) density has the form
\[ f(x|\alpha) = \alpha x^{\alpha-1} \] with support on \(x \in [0,1]\). Letting \(\lambda\) represent the component proportions, the distribution of p-values is modeled as \[ f(x|\lambda,\alpha) = \lambda \times \underbrace{1}_{\text{Unform}} + (1-\lambda)\underbrace{\alpha x^{\alpha-1}}_{\text{Beta}} \] We can estimate the \(\lambda\) and \(\alpha\) using maximum likelihood estimation (often with an EM algorithm to maximize the likelihood) and obtain parameter estimates \(\widehat{\lambda}\) and \(\widehat{\alpha}\). Let \(\pi\) denote the proportion of true null hypotheses. Then \(\pi\) is estimated as the greatest uniform component which can be removed from this density: \[ \widehat{\pi} = \widehat{\lambda} + (1-\widehat{\lambda})\widehat{\alpha} \]
Denote the estimated BUM CDF by \(F(x|\widehat{\lambda},\widehat{\alpha})\). Returning
\[ \widehat{FDR}(\tau) = \frac{\tau \widehat{\pi}}{F(\tau|\widehat{\lambda},\widehat{\alpha})} \] Further if we have a specific \(FDR\) we would like to target, we can choose a \(\tau\) to acheive this threshold.
library(ClassComparison)
## Loading required package: oompaBase
out <- Bum(ps)
out@ahat
## [1] 0.2148261
out@lhat
## [1] 0.3017804
out@lhat + (1-out@lhat)*out@ahat
## [1] 0.4517762
out@pihat
## [1] 0.4517762
hist(out,breaks=20)
## Warning in hist.default(x@pvals, nclass = 100, probability = TRUE, xlab =
## xlab, : 'nclass' not used when 'breaks' is specified
abline(v=0.1,col="red",lwd=3)
We can also use the function selectSignificant
to select genes which pass some posterior probability or FDR threshold.
gene_sel <- selectSignificant(out,alpha=0.99999,by="EmpiricalBayes")
colnames(rna)[gene_sel]
## [1] "ABCC10|89845" "ACOT9|23597" "ADORA1|134"
## [4] "ADORA2B|136" "AGAP4|119016" "AIM2|9447"
## [7] "AKAP6|9472" "AKAP8L|26993" "ALG3|10195"
## [10] "AMH|268" "ANAPC7|51434" "ANKRD13D|338692"
## [13] "ANKRD56|345079" "ANLN|54443" "AR|367"
gene_sel <- selectSignificant(out,alpha=0.00001,by="FDR")
colnames(rna)[gene_sel]
## [1] "AAAS|8086" "ABCC10|89845" "ACCN3|9311"
## [4] "ACOT9|23597" "ADAMTS6|11174" "ADORA1|134"
## [7] "ADORA2B|136" "AGAP4|119016" "AGAP6|414189"
## [10] "AGER|177" "AIM2|9447" "AKAP6|9472"
## [13] "AKAP8L|26993" "ALDH6A1|4329" "ALG3|10195"
## [16] "AMH|268" "AMIGO3|386724" "AMOT|154796"
## [19] "ANAPC7|51434" "ANKRD13D|338692" "ANKRD23|200539"
## [22] "ANKRD56|345079" "ANLN|54443" "AP2M1|1173"
## [25] "AR|367" "ARHGAP11A|9824" "ARHGAP39|80728"
Depending on our inference goal, the proper threshold will differ. For example if we are prescreening genes to add to a regression model we may accept a liberal threshold which admits many genes. The regression model itself will minimize the impact of unrelated genes by setting their coefficients to (near) 0 in the model. In contrast if we are reporting a set a genes for publication, we may want a small set in which we have high confidence of a true effect.
Recall the Cox Proportional Hazards model Breslow partial likelihood is
\[ LL_1(\beta) = \sum_{i=1}^D \left(\beta^T s_i - d_i \log \left(\sum_{j \in R(i)} exp(\beta^T Z_j)\right)\right) \]
We showed that when \(p > n\), for any \(\beta\) there are an infinite number of \(\beta'\) such that \(LL(\beta) = LL(\beta')\). This implies that maximizers of the log likelihood will not be unique (if one exists, there will be an infinite number). Alternatively none may exist.
To address this issue, we can regularize the optimization problem by adding a penalty term. This strategy is known by many names including regularized, penalized, or shrinkage estimators. The typical strategy is to add a term to the likelihood which ensures a unique maximizer at some plausible \(\beta\). In many examples we prefer small \(\beta\) corresponding to small effects. Or even better, solutions where many \(\beta_j=0\), representing a sparse solution in which only a subset of the predictor variables have any influence on the outcome. This can be acheived with the elastic net penalty, a mixture of \(\ell_1\) (sometimes called Lasso penalty) and \(\ell_2\) (sometimes called ridge) regularizers (Zou and Hastie 2005). This method can be used with many models including standard linear regression.
Instead of maximizing \(LL_1\) we maximize (Simon et al. 2011)
\[ PLL(\beta) = \frac{2}{n}LL(\beta) - \lambda\left(\alpha \overbrace{\sum_{j=1}^p |\beta_j|}^{\ell_1 \text{penalty}} + \frac{1}{2}(1-\alpha) \overbrace{\sum_{j=1}^p \beta_j^2}^{\ell_2 \text{ penalty}}\right) \] Notes on the penalty term:
The regularized Cox Model likelihood has a unique maximum because the Hessian is negative definite. See https://www4.stat.ncsu.edu/~dzhang2/st745/chap7.pdf for an explanation.
library(glmnet)
## Loading required package: Matrix
## Loading required package: foreach
## Loaded glmnet 2.0-16
summary(dat$event_time)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 524.2 1138.5 1326.9 1901.0 4537.0
## two individuals have censoring at time 0, remove
ix <- dat$event_time!=0
sum(!ix)
## [1] 2
dat <- dat[ix,]
rna <- rna[ix,]
snv <- snv[ix,]
x <- cbind(rna,snv)
y <- Surv(dat$event_time,dat$obs_death)
?glmnet
fit <- glmnet(x,y,family="cox")
## regularization parameters
length(fit$lambda)
## [1] 100
head(fit$lambda)
## [1] 0.1552665 0.1482094 0.1414730 0.1350429 0.1289050 0.1230460
## what is largest coefficient
co <- coef(fit)
dim(co)
## [1] 1041 100
## rows are variables, columns are different choices of lambda
## we get a different set of coefficients for each lambda
As \(\lambda\) decreases, the size of the coefficients will go up. For example, consider the largest absolute coefficient:
fit$lambda[1]
## [1] 0.1552665
max(abs(co[,1]))
## [1] 0
fit$lambda[50]
## [1] 0.01589201
max(abs(co[,1]))
## [1] 0
fit$lambda[100]
## [1] 0.001552665
max(abs(co[,100]))
## [1] 11.757
Large \(\lambda\) produce sparse solutions which only use a small subset of all possible predictor variables
beta <- co[,20]
length(beta)
## [1] 1041
sum(beta==0)
## [1] 1023
beta[beta!=0]
## ?|340602 ACVR1B|91 ADORA1|134 ADORA2B|136 ADRB3|155
## 1.802704e-02 3.423699e-05 4.598795e-04 1.541511e-04 2.921739e-02
## AKAP6|9472 AKAP8L|26993 AKR1B15|441282 ALDH6A1|4329 AMH|268
## -3.335915e-04 2.675669e-05 5.559236e-04 -4.634101e-05 1.739664e-02
## AMOT|154796 ANAPC7|51434 ANKRD56|345079 ANLN|54443 ANP32A|8125
## -1.658260e-05 3.536835e-04 -3.680497e-04 3.891963e-04 3.599975e-07
## ANXA3|306 AR|367 ARHGEF11|9826
## -4.046147e-05 -1.020423e-03 1.539591e-04
We can visualize the path of the coefficients, how they change as the regularization parameter \(\lambda\) changes:
plot(fit)
Next lecture we will discuss:
Efron, Bradley, Robert Tibshirani, John D Storey, and Virginia Tusher. 2001. “Empirical Bayes Analysis of a Microarray Experiment.” Journal of the American Statistical Association 96 (456). Taylor & Francis: 1151–60.
Pounds, Stan, and Stephan W Morris. 2003. “Estimating the Occurrence of False Positives and False Negatives in Microarray Studies by Approximating and Partitioning the Empirical Distribution of P-Values.” Bioinformatics 19 (10). Oxford University Press: 1236–42.
Simon, Noah, Jerome Friedman, Trevor Hastie, and Rob Tibshirani. 2011. “Regularization Paths for Cox’s Proportional Hazards Model via Coordinate Descent.” Journal of Statistical Software 39 (5). NIH Public Access: 1.
Zou, Hui, and Trevor Hastie. 2005. “Regularization and Variable Selection via the Elastic Net.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67 (2). Wiley Online Library: 301–20.