```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
options(knitr.duplicate.label = 'allow')
```
```{r load_libraries}
library(survival)
```
## Data Set
We will work with a small subset of [The Cancer Genome Atlas](https://cancergenome.nih.gov/) Renal Cell Carcinoma data set.
```{r load_data}
load("kirc_small.RData")
ls()
head(kirc_small)
```
This is right censored data. `time` is time of death or censoring. `death_obs` is 1 if death observed, 0 if censored. `BAP1_mutation` indicates presence of mutation in BAP1 gene. `SLC7A11_ex` is expression of gene SLC7A11.
```{r eda, fig.width=8,fig.height=4}
nrow(kirc_small)
table(kirc_small$death_obs)
table(kirc_small$BAP1_mutation)
par(mfcol=c(1,2))
hist(kirc_small$time)
hist(log2(kirc_small$SLC7A11_ex))
```
### Kaplan-Meier Curve for Survival
Associations and possible causal relations have been discovered for the effects of BAP1 mutations and SLC7A11 on survival. We plot Kaplan Meier for BAP1 and SLC7A11 thresholded on median value.
```{r threshold}
kirc_small[,"SLC7A11_gr_median"] <- kirc_small$SLC7A11_ex > median(kirc_small$SLC7A11_ex)
```
```{r km_analysis,fig.width=8,fig.height=4}
par(mfcol=c(1,2),mar=c(5,5,1,1))
fit1 <- survfit(Surv(time,death_obs)~BAP1_mutation,data=kirc_small)
plot(fit1,xlab="Days Since Diagnosis",ylab="S(t)",col=1:2,lty=1:2,lwd=2)
legend("bottomleft",names(fit1$strata),col=1:2,lty=1:2,lwd=2)
fit2 <- survfit(Surv(time,death_obs)~SLC7A11_gr_median,data=kirc_small)
plot(fit2,xlab="Days Since Diagnosis",ylab="S(t)",col=1:2,lty=1:2,lwd=2)
legend("bottomleft",names(fit2$strata),col=1:2,lty=1:2,lwd=2)
```
```{r km_analysis_pvalues,fig.width=8,fig.height=4}
survdiff(Surv(time,death_obs)~BAP1_mutation,data=kirc_small)
survdiff(Surv(time,death_obs)~SLC7A11_gr_median,data=kirc_small)
```
Comments / Questions :
* Why cut SLC7A11 on median? More generally how / why should we discretize a continuous covariate? (Section 8.6 in book)
* Build flexible model with both SLC7A11 and BAP1?
* Transformations (such as log) for SLC7A11, since heavily right skewed.
See papers [TCGA Renal Cell Carcinoma Analysis](https://www.nature.com/articles/nature12222) and [SLC7A11 and BAP1 as Possible Causes of Poorer Survival Outcomes](https://www.nature.com/articles/s41556-018-0178-0) for more discussion of this data set. We will return to this in a few weeks.
## Cox Proportional Hazards Formulation and Partial Likelihood
Let the data be
$$
(T_j,\delta_j,Z_j)
$$
where $T_j$ is the observation time, $\delta_i$ is 0 for censored, and $Z_j$ is the covariate. The partial likelihood with no ties (book Equation 8.3.1) is
$$
L(\beta) = \prod_{i=1}^D \frac{exp(\sum_{k=1}^p \beta_k Z_{(i)k})}{\sum_{j \in R(t_i)} exp(\sum_{k=1}^p \beta_k Z_{jk})}
$$
Focusing on the case with 1-predictor, i.e. $p=1$, we have
$$
L(\beta) = \prod_{i=1}^D \frac{exp(\beta Z_{(i)})}{\sum_{j \in R(t_i)} exp(\beta Z_{j})}
$$
We seek to find the $\beta$ which maximizes this function which is equivalent to maximizing the log
$$
\beta_{MLE} = argmax_\beta LL(\beta) = \sum_{i=1}^D (\beta Z_{(i)} - \log(\sum_{j \in R(t_i)} exp(\beta Z_j)))
$$
## Maximizing the Partial Likelihood
We consider fitting the Cox proportional hazards model using only `SLC7A11_ex` as a predictor. We will only use unique times. We cover the partial likelihood for tied death times later.
```{r unique_times}
length(unique(kirc_small$time))
nrow(kirc_small)
```
```{r compute_log_likelihood}
## first consider case with no event time ties
kirc_small_unique <- kirc_small[!duplicated(kirc_small$time),]
head(kirc_small_unique)
## simplify variable names
ti <- kirc_small_unique$time
del <- kirc_small_unique$death_obs
z <- kirc_small_unique$SLC7A11_ex
## order data increasing with time
ords <- order(ti)
ti <- ti[ords]
del <- del[ords]
z <- z[ords]
## evaluate log likelihood at this beta
beta <- .1
## the log likelihood is
s <- log(rev(cumsum(rev(exp(beta*z)))))
LL_vec <- beta*z - s
sum(LL_vec[del==1])
```
We need to be able to evaluate the log likelihood at many $\beta$ (so we can find the maximum). So we `functionalize` the above code a bit.
```{r compute_log_likelihood_function}
## the log likelihood is
coxph_log_like <- function(beta,ti,del,z){
s <- log(rev(cumsum(rev(exp(beta*z)))))
LL_vec <- beta*z - s
return(sum(LL_vec[del==1]))
}
```
We now plot the log likelihood over a range of beta values
```{r plot_log_like}
## finding an appropriate grid for betas can be tricky
## could standardized predictor first (set to mean 0, sd=1)
betas <- seq(.0001,0.03,by=0.00001)
log_like <- vapply(betas,function(beta) coxph_log_like(beta,ti,del,z),c(0))
mle <- betas[which.max(log_like)]
plot(betas,log_like,ylim=c(-500,-440),log='x')
abline(v=mle)
print(paste("The MLE is:",mle))
```
Compare result to built in `coxph` function in survival package:
```{r fit_coxph}
fit <- coxph(Surv(time,death_obs)~SLC7A11_ex,data=kirc_small_unique)
print(fit)
```
In practice, the grid search which we did here is **very** computationally inefficient:
* If you need $p$ grid points with 1 parameter (p grid points), then $p^2$ with parameters, $p^3$ with 3 parameters, . . .
* Typically find roots of derivative of log likelihood (score) using Newton Raphson, which requires computing matrix of second derivatives (Hessian). See book appendix A for details.
* There are lots of possible challenges with optimization that we will not discuss in detail: local maxima, algorithm starting points, iteration step sizes, convergence criteria, etc.
## Intuitive Derivation of Partial Likelihood
This is covered in Theoretical Notes 1 (partial likelihood derivation) and 2 (profile likelihood derivation) on page 257 of textbook. We discussed Note 2 in class.
## Optional Exercises
* Generalize the function `coxph_log_like` to use a 2 dimensional $\beta$ vector and evaluate the likelihood on a grid, using both `BAP1` and `SLC7A11_ex` as predictors. Find the maximizer of the likelihood on the grid. Verify your solution by using the `coxph` function in the `survival` package.
* Rather than use a grid search, use one of the builtin optimization functions in `R` such as `optim` or `optimize`.
```{r makedotRfile, echo=FALSE, include=FALSE}
## automatrically generates .R file when running knitr
knitr::purl("01coxph.Rmd")
```