We use the cleaned TCGA KIRC data available at the course website.

```
load("tcga_kirc.RData")
ls()
```

`## [1] "dat" "rna" "snv"`

`dim(rna)`

`## [1] 378 20531`

`dim(snv)`

`## [1] 378 44`

`dim(dat)`

`## [1] 378 61`

```
## event_time (death or censoring time) and obs_death (censoring variable)
## are of most interest
tail(colnames(dat))
```

```
## [1] "days_to_hiv_diagnosis" "ajcc_clinical_stage" "days_to_new_event"
## [4] "hiv_positive" "event_time" "obs_death"
```

```
## to keep things fast, we only use N rna expressions
N <- 1000
rna <- rna[,1:N]
```

```
## remove all rna with 0 expression for all subjects
rna <- rna[,colSums(rna)!=0]
hist(log10(rna))
```

Consider testing the null hypothesis of a gene \(i\) having no effect on survival. Assuming any relationship follows a Cox PH model, i.e. harzard function of \[ \lambda(t|z_j) = \lambda_0(t)e^{\beta_j z_j} \] then the hypotheses are \[ H_0: \beta_j = 0\\ H_a: \beta_j \neq 0 \] We compute p-values for all of these 997 hypotheses:

```
library(survival)
ps <- rep(1,ncol(rna))
names(ps) <- colnames(rna)
y <- Surv(dat$event_time,dat$obs_death)
for(ii in 1:ncol(rna)){
fit <- coxph(y~rna[,ii])
ps[ii] <- pchisq(fit$wald.test,df=1,lower.tail=FALSE)
}
summary(ps)
```

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000000 0.005286 0.139087 0.280805 0.508861 0.999517
```

`hist(ps)`

Questions:

- Which genes are truly associated with survival?
- Which genes should we include in a multivariate model?

If RNA expression were completely unrelated to survival, these p-values would follow a uniform[0,1] distribution. To confirm this, we run a simulation:

```
rna_sim <- matrix(rnorm(nrow(rna)*ncol(rna)),nrow=nrow(rna),ncol=ncol(rna))
colnames(rna_sim) <- colnames(rna)
ps_sim <- rep(1,ncol(rna_sim))
names(ps) <- colnames(rna_sim)
y <- Surv(dat$event_time,dat$obs_death)
for(ii in 1:ncol(rna_sim)){
fit <- coxph(y~rna_sim[,ii])
ps_sim[ii] <- pchisq(fit$wald.test,df=1,lower.tail=FALSE)
}
hist(ps_sim)
```

Consider the following hypothetical p-value distribution:

Using the p-value distribution we can:

**Determine False Discovery Rate for Set of Selected Genes**: Suppose we decide to label all genes with p-values less than \(\tau\) as significant. What is our False Positive Rate? We have \(A + C\) significant genes, of which, \(C\) are false positives. So our False Positive Rate or False Discovery Rate is

\[ FDR(\tau) = \frac{C}{A + C} \]

**Determine False Discovery Rate for Particular Gene**Let the density height at \(\tau\) be \(f(\tau)\). Then the False Discovery rate at \(\tau\), sometimes called the Local False Discovery Rate, is \[ LFDR(\tau) = \frac{\pi}{f(\tau)} \] This is the the probability the the gene is not associated with the response (i.e. posterior probability the null hypothesis is true). For a review of local false discovery rate and empirical bayes estimation see (Efron et al. 2001).

The Beta Uniform Mixture (BUM) model has been proposed as an approximation to the p-value density (Pounds and Morris 2003). A \(Beta(\alpha,1)\) density has the form

\[ f(x|\alpha) = \alpha x^{\alpha-1} \] with support on \(x \in [0,1]\). Letting \(\lambda\) represent the component proportions, the distribution of p-values is modeled as \[ f(x|\lambda,\alpha) = \lambda \times \underbrace{1}_{\text{Unform}} + (1-\lambda)\underbrace{\alpha x^{\alpha-1}}_{\text{Beta}} \] We can estimate the \(\lambda\) and \(\alpha\) using maximum likelihood estimation (often with an EM algorithm to maximize the likelihood) and obtain parameter estimates \(\widehat{\lambda}\) and \(\widehat{\alpha}\). Let \(\pi\) denote the proportion of true null hypotheses. Then \(\pi\) is estimated as the greatest uniform component which can be removed from this density: \[ \widehat{\pi} = \widehat{\lambda} + (1-\widehat{\lambda})\widehat{\alpha} \]

Denote the estimated BUM CDF by \(F(x|\widehat{\lambda},\widehat{\alpha})\). Returning

**Determine False Discovery Rate for Set of Selected Genes**: We can estimate \(C\) with \(\tau \widehat{\pi}\) and \(A + C\) with \(F(\tau|\widehat{\lambda},\widehat{\alpha})\), so our estimate of the FDR at \(\tau\) is

\[ \widehat{FDR}(\tau) = \frac{\tau \widehat{\pi}}{F(\tau|\widehat{\lambda},\widehat{\alpha})} \] Further if we have a specific \(FDR\) we would like to target, we can choose a \(\tau\) to acheive this threshold.

**Determine False Discovery Rate for Particular Gene**The estimated Local False Discovery Rate, is \[ \widehat{LFDR}(\tau) = \frac{\widehat{\pi}}{f(\tau|\widehat{\lambda},\widehat{\alpha})} \] And recall the probability the alternative is true given a p-value of \(\tau\) is \(p(H_a|\tau) = 1 - \widehat{LFDR}(\tau)\).

`library(ClassComparison)`

`## Loading required package: oompaBase`

```
out <- Bum(ps)
out@ahat
```

`## [1] 0.2148261`

`out@lhat`

`## [1] 0.3017804`

`out@lhat + (1-out@lhat)*out@ahat`

`## [1] 0.4517762`

`out@pihat`

`## [1] 0.4517762`

`hist(out,breaks=20)`

```
## Warning in hist.default(x@pvals, nclass = 100, probability = TRUE, xlab =
## xlab, : 'nclass' not used when 'breaks' is specified
```

`abline(v=0.1,col="red",lwd=3)`