Load and Clean Data

We use the cleaned TCGA KIRC data available at the course website.

load("tcga_kirc.RData")
ls()
## [1] "dat" "rna" "snv"
dim(rna)
## [1]   378 20531
dim(snv)
## [1] 378  44
dim(dat)
## [1] 378  61
## event_time (death or censoring time) and obs_death (censoring variable)
## are of most interest
tail(colnames(dat))
## [1] "days_to_hiv_diagnosis" "ajcc_clinical_stage"   "days_to_new_event"    
## [4] "hiv_positive"          "event_time"            "obs_death"
## to keep things fast, we only use N rna expressions
N <- 1000
rna <- rna[,1:N]
## remove all rna with 0 expression for all subjects
rna <- rna[,colSums(rna)!=0]
hist(log10(rna))

What Features Are Significantly Associated with Survival

Consider testing the null hypothesis of a gene \(i\) having no effect on survival. Assuming any relationship follows a Cox PH model, i.e. harzard function of \[ \lambda(t|z_j) = \lambda_0(t)e^{\beta_j z_j} \] then the hypotheses are \[ H_0: \beta_j = 0\\ H_a: \beta_j \neq 0 \] We compute p-values for all of these 997 hypotheses:

library(survival)
ps <- rep(1,ncol(rna))
names(ps) <- colnames(rna)
y <- Surv(dat$event_time,dat$obs_death)
for(ii in 1:ncol(rna)){
  fit <- coxph(y~rna[,ii])
  ps[ii] <- pchisq(fit$wald.test,df=1,lower.tail=FALSE)
}

summary(ps)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.000000 0.005286 0.139087 0.280805 0.508861 0.999517
hist(ps)

Questions:

If RNA expression were completely unrelated to survival, these p-values would follow a uniform[0,1] distribution. To confirm this, we run a simulation:

rna_sim <- matrix(rnorm(nrow(rna)*ncol(rna)),nrow=nrow(rna),ncol=ncol(rna))
colnames(rna_sim) <- colnames(rna)

ps_sim <- rep(1,ncol(rna_sim))
names(ps) <- colnames(rna_sim)
y <- Surv(dat$event_time,dat$obs_death)
for(ii in 1:ncol(rna_sim)){
  fit <- coxph(y~rna_sim[,ii])
  ps_sim[ii] <- pchisq(fit$wald.test,df=1,lower.tail=FALSE)
}
hist(ps_sim)

Error Rates

Consider the following hypothetical p-value distribution:

Using the p-value distribution we can:

  1. Determine False Discovery Rate for Set of Selected Genes: Suppose we decide to label all genes with p-values less than \(\tau\) as significant. What is our False Positive Rate? We have \(A + C\) significant genes, of which, \(C\) are false positives. So our False Positive Rate or False Discovery Rate is

\[ FDR(\tau) = \frac{C}{A + C} \]

  1. Determine False Discovery Rate for Particular Gene Let the density height at \(\tau\) be \(f(\tau)\). Then the False Discovery rate at \(\tau\), sometimes called the Local False Discovery Rate, is \[ LFDR(\tau) = \frac{\pi}{f(\tau)} \] This is the the probability the the gene is not associated with the response (i.e. posterior probability the null hypothesis is true). For a review of local false discovery rate and empirical bayes estimation see (Efron et al. 2001).

Beta Uniform Mixture (BUM) Model

The Beta Uniform Mixture (BUM) model has been proposed as an approximation to the p-value density (Pounds and Morris 2003). A \(Beta(\alpha,1)\) density has the form

\[ f(x|\alpha) = \alpha x^{\alpha-1} \] with support on \(x \in [0,1]\). Letting \(\lambda\) represent the component proportions, the distribution of p-values is modeled as \[ f(x|\lambda,\alpha) = \lambda \times \underbrace{1}_{\text{Unform}} + (1-\lambda)\underbrace{\alpha x^{\alpha-1}}_{\text{Beta}} \] We can estimate the \(\lambda\) and \(\alpha\) using maximum likelihood estimation (often with an EM algorithm to maximize the likelihood) and obtain parameter estimates \(\widehat{\lambda}\) and \(\widehat{\alpha}\). Let \(\pi\) denote the proportion of true null hypotheses. Then \(\pi\) is estimated as the greatest uniform component which can be removed from this density: \[ \widehat{\pi} = \widehat{\lambda} + (1-\widehat{\lambda})\widehat{\alpha} \]

Denote the estimated BUM CDF by \(F(x|\widehat{\lambda},\widehat{\alpha})\). Returning

  1. Determine False Discovery Rate for Set of Selected Genes: We can estimate \(C\) with \(\tau \widehat{\pi}\) and \(A + C\) with \(F(\tau|\widehat{\lambda},\widehat{\alpha})\), so our estimate of the FDR at \(\tau\) is

\[ \widehat{FDR}(\tau) = \frac{\tau \widehat{\pi}}{F(\tau|\widehat{\lambda},\widehat{\alpha})} \] Further if we have a specific \(FDR\) we would like to target, we can choose a \(\tau\) to acheive this threshold.

  1. Determine False Discovery Rate for Particular Gene The estimated Local False Discovery Rate, is \[ \widehat{LFDR}(\tau) = \frac{\widehat{\pi}}{f(\tau|\widehat{\lambda},\widehat{\alpha})} \] And recall the probability the alternative is true given a p-value of \(\tau\) is \(p(H_a|\tau) = 1 - \widehat{LFDR}(\tau)\).

Implementation of BUM Model

library(ClassComparison)
## Loading required package: oompaBase
out <- Bum(ps)

out@ahat
## [1] 0.2148261
out@lhat
## [1] 0.3017804
out@lhat + (1-out@lhat)*out@ahat
## [1] 0.4517762
out@pihat
## [1] 0.4517762
hist(out,breaks=20)
## Warning in hist.default(x@pvals, nclass = 100, probability = TRUE, xlab =
## xlab, : 'nclass' not used when 'breaks' is specified
abline(v=0.1,col="red",lwd=3)