Multiple Testing, Variable Screening, and Regularization with Cox PH Model

Load and Clean Data

We use the cleaned TCGA KIRC data available at the course website.

load("tcga_kirc.RData")
ls()

## [1] "dat" "rna" "snv"

dim(rna)

## [1]   378 20531

dim(snv)

## [1] 378  44

dim(dat)

## [1] 378  61

## event_time (death or censoring time) and obs_death (censoring variable)
## are of most interest
tail(colnames(dat))

## [1] "days_to_hiv_diagnosis" "ajcc_clinical_stage"   "days_to_new_event"    
## [4] "hiv_positive"          "event_time"            "obs_death"

## to keep things fast, we only use N rna expressions
N <- 1000
rna <- rna[,1:N]

## remove all rna with 0 expression for all subjects
rna <- rna[,colSums(rna)!=0]
hist(log10(rna))

What Features Are Significantly Associated with Survival

Consider testing the null hypothesis of a gene \(i\) having no effect on survival. Assuming any relationship follows a Cox PH model, i.e. harzard function of \[ \lambda(t|z_j) = \lambda_0(t)e^{\beta_j z_j} \] then the hypotheses are \[ H_0: \beta_j = 0\\ H_a: \beta_j \neq 0 \] We compute p-values for all of these 997 hypotheses:

library(survival)
ps <- rep(1,ncol(rna))
names(ps) <- colnames(rna)
y <- Surv(dat$event_time,dat$obs_death)
for(ii in 1:ncol(rna)){
  fit <- coxph(y~rna[,ii])
  ps[ii] <- pchisq(fit$wald.test,df=1,lower.tail=FALSE)
}

summary(ps)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.000000 0.005286 0.139087 0.280805 0.508861 0.999517

hist(ps)

Questions:

Which genes are truly associated with survival?
Which genes should we include in a multivariate model?

If RNA expression were completely unrelated to survival, these p-values would follow a uniform[0,1] distribution. To confirm this, we run a simulation:

rna_sim <- matrix(rnorm(nrow(rna)*ncol(rna)),nrow=nrow(rna),ncol=ncol(rna))
colnames(rna_sim) <- colnames(rna)

ps_sim <- rep(1,ncol(rna_sim))
names(ps) <- colnames(rna_sim)
y <- Surv(dat$event_time,dat$obs_death)
for(ii in 1:ncol(rna_sim)){
  fit <- coxph(y~rna_sim[,ii])
  ps_sim[ii] <- pchisq(fit$wald.test,df=1,lower.tail=FALSE)
}
hist(ps_sim)

Error Rates

Consider the following hypothetical p-value distribution:

Using the p-value distribution we can:

Determine False Discovery Rate for Set of Selected Genes: Suppose we decide to label all genes with p-values less than \(\tau\) as significant. What is our False Positive Rate? We have \(A + C\) significant genes, of which, \(C\) are false positives. So our False Positive Rate or False Discovery Rate is

\[ FDR(\tau) = \frac{C}{A + C} \]

Determine False Discovery Rate for Particular Gene Let the density height at \(\tau\) be \(f(\tau)\). Then the False Discovery rate at \(\tau\), sometimes called the Local False Discovery Rate, is \[ LFDR(\tau) = \frac{\pi}{f(\tau)} \] This is the the probability the the gene is not associated with the response (i.e. posterior probability the null hypothesis is true). For a review of local false discovery rate and empirical bayes estimation see (Efron et al. 2001).

Beta Uniform Mixture (BUM) Model

The Beta Uniform Mixture (BUM) model has been proposed as an approximation to the p-value density (Pounds and Morris 2003). A \(Beta(\alpha,1)\) density has the form

\[ f(x|\alpha) = \alpha x^{\alpha-1} \] with support on \(x \in [0,1]\). Letting \(\lambda\) represent the component proportions, the distribution of p-values is modeled as \[ f(x|\lambda,\alpha) = \lambda \times \underbrace{1}_{\text{Unform}} + (1-\lambda)\underbrace{\alpha x^{\alpha-1}}_{\text{Beta}} \] We can estimate the \(\lambda\) and \(\alpha\) using maximum likelihood estimation (often with an EM algorithm to maximize the likelihood) and obtain parameter estimates \(\widehat{\lambda}\) and \(\widehat{\alpha}\). Let \(\pi\) denote the proportion of true null hypotheses. Then \(\pi\) is estimated as the greatest uniform component which can be removed from this density: \[ \widehat{\pi} = \widehat{\lambda} + (1-\widehat{\lambda})\widehat{\alpha} \]

Denote the estimated BUM CDF by \(F(x|\widehat{\lambda},\widehat{\alpha})\). Returning

Determine False Discovery Rate for Set of Selected Genes: We can estimate \(C\) with \(\tau \widehat{\pi}\) and \(A + C\) with \(F(\tau|\widehat{\lambda},\widehat{\alpha})\), so our estimate of the FDR at \(\tau\) is

\[ \widehat{FDR}(\tau) = \frac{\tau \widehat{\pi}}{F(\tau|\widehat{\lambda},\widehat{\alpha})} \] Further if we have a specific \(FDR\) we would like to target, we can choose a \(\tau\) to acheive this threshold.

Determine False Discovery Rate for Particular Gene The estimated Local False Discovery Rate, is \[ \widehat{LFDR}(\tau) = \frac{\widehat{\pi}}{f(\tau|\widehat{\lambda},\widehat{\alpha})} \] And recall the probability the alternative is true given a p-value of \(\tau\) is \(p(H_a|\tau) = 1 - \widehat{LFDR}(\tau)\).

Implementation of BUM Model

library(ClassComparison)

## Loading required package: oompaBase

out <- Bum(ps)

out@ahat

## [1] 0.2148261

out@lhat

## [1] 0.3017804

out@lhat + (1-out@lhat)*out@ahat

## [1] 0.4517762

out@pihat

## [1] 0.4517762

hist(out,breaks=20)

## Warning in hist.default(x@pvals, nclass = 100, probability = TRUE, xlab =
## xlab, : 'nclass' not used when 'breaks' is specified

abline(v=0.1,col="red",lwd=3)

We can also use the function selectSignificant to select genes which pass some posterior probability or FDR threshold.

gene_sel <- selectSignificant(out,alpha=0.99999,by="EmpiricalBayes")
colnames(rna)[gene_sel]

##  [1] "ABCC10|89845"    "ACOT9|23597"     "ADORA1|134"     
##  [4] "ADORA2B|136"     "AGAP4|119016"    "AIM2|9447"      
##  [7] "AKAP6|9472"      "AKAP8L|26993"    "ALG3|10195"     
## [10] "AMH|268"         "ANAPC7|51434"    "ANKRD13D|338692"
## [13] "ANKRD56|345079"  "ANLN|54443"      "AR|367"

gene_sel <- selectSignificant(out,alpha=0.00001,by="FDR")
colnames(rna)[gene_sel]

##  [1] "AAAS|8086"       "ABCC10|89845"    "ACCN3|9311"     
##  [4] "ACOT9|23597"     "ADAMTS6|11174"   "ADORA1|134"     
##  [7] "ADORA2B|136"     "AGAP4|119016"    "AGAP6|414189"   
## [10] "AGER|177"        "AIM2|9447"       "AKAP6|9472"     
## [13] "AKAP8L|26993"    "ALDH6A1|4329"    "ALG3|10195"     
## [16] "AMH|268"         "AMIGO3|386724"   "AMOT|154796"    
## [19] "ANAPC7|51434"    "ANKRD13D|338692" "ANKRD23|200539" 
## [22] "ANKRD56|345079"  "ANLN|54443"      "AP2M1|1173"     
## [25] "AR|367"          "ARHGAP11A|9824"  "ARHGAP39|80728"

Using These Results

Depending on our inference goal, the proper threshold will differ. For example if we are prescreening genes to add to a regression model we may accept a liberal threshold which admits many genes. The regression model itself will minimize the impact of unrelated genes by setting their coefficients to (near) 0 in the model. In contrast if we are reporting a set a genes for publication, we may want a small set in which we have high confidence of a true effect.

Regularized Cox Proportional Hazards Model

Recall the Cox Proportional Hazards model Breslow partial likelihood is

\[ LL_1(\beta) = \sum_{i=1}^D \left(\beta^T s_i - d_i \log \left(\sum_{j \in R(i)} exp(\beta^T Z_j)\right)\right) \]

We showed that when \(p > n\), for any \(\beta\) there are an infinite number of \(\beta'\) such that \(LL(\beta) = LL(\beta')\). This implies that maximizers of the log likelihood will not be unique (if one exists, there will be an infinite number). Alternatively none may exist.

To address this issue, we can regularize the optimization problem by adding a penalty term. This strategy is known by many names including regularized, penalized, or shrinkage estimators. The typical strategy is to add a term to the likelihood which ensures a unique maximizer at some plausible \(\beta\). In many examples we prefer small \(\beta\) corresponding to small effects. Or even better, solutions where many \(\beta_j=0\), representing a sparse solution in which only a subset of the predictor variables have any influence on the outcome. This can be acheived with the elastic net penalty, a mixture of \(\ell_1\) (sometimes called Lasso penalty) and \(\ell_2\) (sometimes called ridge) regularizers (Zou and Hastie 2005). This method can be used with many models including standard linear regression.

Instead of maximizing \(LL_1\) we maximize (Simon et al. 2011)

\[ PLL(\beta) = \frac{2}{n}LL(\beta) - \lambda\left(\alpha \overbrace{\sum_{j=1}^p |\beta_j|}^{\ell_1 \text{penalty}} + \frac{1}{2}(1-\alpha) \overbrace{\sum_{j=1}^p \beta_j^2}^{\ell_2 \text{ penalty}}\right) \] Notes on the penalty term:

The penalty terms grow larger as \(|\beta_j|\) grow, thus encouraging solutions near \(0\).
The tuning parameters \(\lambda\) and \(\alpha\) are generally chosen to optimize some metric, such as AIC, accuracy, or Concordance index. We discuss tuning parameter selection later.
\(\lambda\) controls how much penalty. Larger \(\lambda\), smaller coefficients.
\(\alpha\) controls the balance of \(\ell_1\) and \(\ell_2\) in the penalty.
The geometry of the \(\ell_1\) penalty encourages sparse solutions which use a small number of variables. The selected variables can be unstable (very different set of selected variables if one perturbs the data), so the \(\ell_2\) ridge penalty is used to stabilize the selected set.

The regularized Cox Model likelihood has a unique maximum because the Hessian is negative definite. See https://www4.stat.ncsu.edu/~dzhang2/st745/chap7.pdf for an explanation.

Implementation of Cox glmnet

library(glmnet)

## Loading required package: Matrix

## Loading required package: foreach

## Loaded glmnet 2.0-16

summary(dat$event_time)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   524.2  1138.5  1326.9  1901.0  4537.0

## two individuals have censoring at time 0, remove
ix <- dat$event_time!=0
sum(!ix)

## [1] 2

dat <- dat[ix,]
rna <- rna[ix,]
snv <- snv[ix,]
x <- cbind(rna,snv)
y <- Surv(dat$event_time,dat$obs_death)

?glmnet

fit <- glmnet(x,y,family="cox")


## regularization parameters
length(fit$lambda)

## [1] 100

head(fit$lambda)

## [1] 0.1552665 0.1482094 0.1414730 0.1350429 0.1289050 0.1230460

## what is largest coefficient
co <- coef(fit)
dim(co)

## [1] 1041  100

## rows are variables, columns are different choices of lambda
## we get a different set of coefficients for each lambda

As \(\lambda\) decreases, the size of the coefficients will go up. For example, consider the largest absolute coefficient:

fit$lambda[1]

## [1] 0.1552665

max(abs(co[,1]))

## [1] 0

fit$lambda[50]

## [1] 0.01589201

max(abs(co[,1]))

## [1] 0

fit$lambda[100]

## [1] 0.001552665

max(abs(co[,100]))

## [1] 11.757

Large \(\lambda\) produce sparse solutions which only use a small subset of all possible predictor variables

beta <- co[,20]
length(beta)

## [1] 1041

sum(beta==0)

## [1] 1023

beta[beta!=0]

##       ?|340602      ACVR1B|91     ADORA1|134    ADORA2B|136      ADRB3|155 
##   1.802704e-02   3.423699e-05   4.598795e-04   1.541511e-04   2.921739e-02 
##     AKAP6|9472   AKAP8L|26993 AKR1B15|441282   ALDH6A1|4329        AMH|268 
##  -3.335915e-04   2.675669e-05   5.559236e-04  -4.634101e-05   1.739664e-02 
##    AMOT|154796   ANAPC7|51434 ANKRD56|345079     ANLN|54443    ANP32A|8125 
##  -1.658260e-05   3.536835e-04  -3.680497e-04   3.891963e-04   3.599975e-07 
##      ANXA3|306         AR|367  ARHGEF11|9826 
##  -4.046147e-05  -1.020423e-03   1.539591e-04

We can visualize the path of the coefficients, how they change as the regularization parameter \(\lambda\) changes:

plot(fit)

Upcoming Topics

Next lecture we will discuss:

Tuning parameter selection for elastic net.
Concordance index for measuring model performance.
Homework 5.

References

Efron, Bradley, Robert Tibshirani, John D Storey, and Virginia Tusher. 2001. “Empirical Bayes Analysis of a Microarray Experiment.” Journal of the American Statistical Association 96 (456). Taylor & Francis: 1151–60.

Pounds, Stan, and Stephan W Morris. 2003. “Estimating the Occurrence of False Positives and False Negatives in Microarray Studies by Approximating and Partitioning the Empirical Distribution of P-Values.” Bioinformatics 19 (10). Oxford University Press: 1236–42.

Simon, Noah, Jerome Friedman, Trevor Hastie, and Rob Tibshirani. 2011. “Regularization Paths for Cox’s Proportional Hazards Model via Coordinate Descent.” Journal of Statistical Software 39 (5). NIH Public Access: 1.

Zou, Hui, and Trevor Hastie. 2005. “Regularization and Variable Selection via the Elastic Net.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67 (2). Wiley Online Library: 301–20.