library(survival)
We will work with a small subset of The Cancer Genome Atlas Renal Cell Carcinoma data set.
load("kirc_small.RData")
ls()
## [1] "kirc_small"
head(kirc_small)
## time death_obs BAP1_mutation SLC7A11_ex
## TCGA-3Z-A93Z-01A-11D-A36X-10 385 0 0 33.2466
## TCGA-6D-AA2E-01A-11D-A36X-10 362 0 0 16.7296
## TCGA-A3-3308-01A-01D-0966-08 16 0 1 132.5840
## TCGA-A3-3311-01A-01D-0966-08 1191 1 0 52.6930
## TCGA-A3-3313-01A-01D-0966-08 735 1 0 565.8552
## TCGA-A3-3316-01A-01D-0966-08 1493 0 0 93.8776
This is right censored data. time
is time of death or censoring. death_obs
is 1 if death observed, 0 if censored. BAP1_mutation
indicates presence of mutation in BAP1 gene. SLC7A11_ex
is expression of gene SLC7A11.
nrow(kirc_small)
## [1] 378
table(kirc_small$death_obs)
##
## 0 1
## 279 99
table(kirc_small$BAP1_mutation)
##
## 0 1
## 335 43
par(mfcol=c(1,2))
hist(kirc_small$time)
hist(log2(kirc_small$SLC7A11_ex))
Associations and possible causal relations have been discovered for the effects of BAP1 mutations and SLC7A11 on survival. We plot Kaplan Meier for BAP1 and SLC7A11 thresholded on median value.
kirc_small[,"SLC7A11_gr_median"] <- kirc_small$SLC7A11_ex > median(kirc_small$SLC7A11_ex)
par(mfcol=c(1,2),mar=c(5,5,1,1))
fit1 <- survfit(Surv(time,death_obs)~BAP1_mutation,data=kirc_small)
plot(fit1,xlab="Days Since Diagnosis",ylab="S(t)",col=1:2,lty=1:2,lwd=2)
legend("bottomleft",names(fit1$strata),col=1:2,lty=1:2,lwd=2)
fit2 <- survfit(Surv(time,death_obs)~SLC7A11_gr_median,data=kirc_small)
plot(fit2,xlab="Days Since Diagnosis",ylab="S(t)",col=1:2,lty=1:2,lwd=2)
legend("bottomleft",names(fit2$strata),col=1:2,lty=1:2,lwd=2)
survdiff(Surv(time,death_obs)~BAP1_mutation,data=kirc_small)
## Call:
## survdiff(formula = Surv(time, death_obs) ~ BAP1_mutation, data = kirc_small)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## BAP1_mutation=0 335 79 87.3 0.795 6.76
## BAP1_mutation=1 43 20 11.7 5.951 6.76
##
## Chisq= 6.8 on 1 degrees of freedom, p= 0.009
survdiff(Surv(time,death_obs)~SLC7A11_gr_median,data=kirc_small)
## Call:
## survdiff(formula = Surv(time, death_obs) ~ SLC7A11_gr_median,
## data = kirc_small)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## SLC7A11_gr_median=FALSE 189 40 49.4 1.77 3.55
## SLC7A11_gr_median=TRUE 189 59 49.6 1.76 3.55
##
## Chisq= 3.5 on 1 degrees of freedom, p= 0.06
Comments / Questions :
See papers TCGA Renal Cell Carcinoma Analysis and SLC7A11 and BAP1 as Possible Causes of Poorer Survival Outcomes for more discussion of this data set. We will return to this in a few weeks.
Let the data be \[ (T_j,\delta_j,Z_j) \] where \(T_j\) is the observation time, \(\delta_i\) is 0 for censored, and \(Z_j\) is the covariate. The partial likelihood with no ties (book Equation 8.3.1) is
\[ L(\beta) = \prod_{i=1}^D \frac{exp(\sum_{k=1}^p \beta_k Z_{(i)k})}{\sum_{j \in R(t_i)} exp(\sum_{k=1}^p \beta_k Z_{jk})} \]
Focusing on the case with 1-predictor, i.e. \(p=1\), we have
\[ L(\beta) = \prod_{i=1}^D \frac{exp(\beta Z_{(i)})}{\sum_{j \in R(t_i)} exp(\beta Z_{j})} \] We seek to find the \(\beta\) which maximizes this function which is equivalent to maximizing the log
\[ \beta_{MLE} = argmax_\beta LL(\beta) = \sum_{i=1}^D (\beta Z_{(i)} - \log(\sum_{j \in R(t_i)} exp(\beta Z_j))) \]
We consider fitting the Cox proportional hazards model using only SLC7A11_ex
as a predictor. We will only use unique times. We cover the partial likelihood for tied death times later.
length(unique(kirc_small$time))
## [1] 347
nrow(kirc_small)
## [1] 378
## first consider case with no event time ties
kirc_small_unique <- kirc_small[!duplicated(kirc_small$time),]
head(kirc_small_unique)
## time death_obs BAP1_mutation SLC7A11_ex
## TCGA-3Z-A93Z-01A-11D-A36X-10 385 0 0 33.2466
## TCGA-6D-AA2E-01A-11D-A36X-10 362 0 0 16.7296
## TCGA-A3-3308-01A-01D-0966-08 16 0 1 132.5840
## TCGA-A3-3311-01A-01D-0966-08 1191 1 0 52.6930
## TCGA-A3-3313-01A-01D-0966-08 735 1 0 565.8552
## TCGA-A3-3316-01A-01D-0966-08 1493 0 0 93.8776
## SLC7A11_gr_median
## TCGA-3Z-A93Z-01A-11D-A36X-10 FALSE
## TCGA-6D-AA2E-01A-11D-A36X-10 FALSE
## TCGA-A3-3308-01A-01D-0966-08 TRUE
## TCGA-A3-3311-01A-01D-0966-08 TRUE
## TCGA-A3-3313-01A-01D-0966-08 TRUE
## TCGA-A3-3316-01A-01D-0966-08 TRUE
## simplify variable names
ti <- kirc_small_unique$time
del <- kirc_small_unique$death_obs
z <- kirc_small_unique$SLC7A11_ex
## order data increasing with time
ords <- order(ti)
ti <- ti[ords]
del <- del[ords]
z <- z[ords]
## evaluate log likelihood at this beta
beta <- .1
## the log likelihood is
s <- log(rev(cumsum(rev(exp(beta*z)))))
LL_vec <- beta*z - s
sum(LL_vec[del==1])
## [1] -5087.516
We need to be able to evaluate the log likelihood at many \(\beta\) (so we can find the maximum). So we functionalize
the above code a bit.
## the log likelihood is
coxph_log_like <- function(beta,ti,del,z){
s <- log(rev(cumsum(rev(exp(beta*z)))))
LL_vec <- beta*z - s
return(sum(LL_vec[del==1]))
}
We now plot the log likelihood over a range of beta values
## finding an appropriate grid for betas can be tricky
## could standardized predictor first (set to mean 0, sd=1)
betas <- seq(.0001,0.03,by=0.00001)
log_like <- vapply(betas,function(beta) coxph_log_like(beta,ti,del,z),c(0))
mle <- betas[which.max(log_like)]
plot(betas,log_like,ylim=c(-500,-440),log='x')
abline(v=mle)
print(paste("The MLE is:",mle))
## [1] "The MLE is: 0.00266"
Compare result to built in coxph
function in survival package:
fit <- coxph(Surv(time,death_obs)~SLC7A11_ex,data=kirc_small_unique)
print(fit)
## Call:
## coxph(formula = Surv(time, death_obs) ~ SLC7A11_ex, data = kirc_small_unique)
##
## coef exp(coef) se(coef) z p
## SLC7A11_ex 0.0026560 1.0026596 0.0006478 4.1 4.13e-05
##
## Likelihood ratio test=11.52 on 1 df, p=0.0006881
## n= 347, number of events= 89
In practice, the grid search which we did here is very computationally inefficient:
This is covered in Theoretical Notes 1 (partial likelihood derivation) and 2 (profile likelihood derivation) on page 257 of textbook. We discussed Note 2 in class.
coxph_log_like
to use a 2 dimensional \(\beta\) vector and evaluate the likelihood on a grid, using both BAP1
and SLC7A11_ex
as predictors. Find the maximizer of the likelihood on the grid. Verify your solution by using the coxph
function in the survival
package.R
such as optim
or optimize
.