The Cancer Genome Atlas and High Dimensional Inference

3/26/2019

The Cancer Genome Atlas (TCGA)

What is TCGA?

Comprehensive Cancer Genomic Data Set

Project began in 2005 with mission to "obtain a comprehensive understanding of the genomic alterations that underlie all major cancers"
68 cancer primary sites
33,549 patients
Genomic, transcriptomic, proteomic, methylation, images for most patients
Most data is freely available:

TCGA Science

Science / Medical Papers
Statistical / Bioinformatics Papers with TCGA Applications

TCGA Data

Clinical Data: Dozens of variables per patients
- Cancer type, age at diagnosis, cancer stage, survival time
Genomic Data: Tens of thousands of variables per case
- We focus on Somatic SNVs and Gene Expression
Combination of Clinical and Genomic data for large number of cancer types makes TCGA a unique resource

Single Nucleotide Variant (SNV)

DNA contains approximately 3 billion base pairs (T,C,G,orA) in the chromosomes of each cell
Base pairs are mostly the same in different cells
Cancer cells have altered base pairs, known as somatic Single Nucleotide Variants (SNV)
SNVs may drive the growth and division of cells

Image Source: https://en.wikipedia.org/wiki/Single-nucleotide_polymorphism

Gene Expression Data

"A gene is a sequence of nucleotides in DNA or RNA that codes for a molecule that has a function" Wikipedia
~20,000 genes in human body
The expression level of a gene is, roughly, the number of RNA sequences in a cell corresponding to a specific gene

Image Source: https://www.ncbi.nlm.nih.gov/probe/docs/applexpression/

Assembling TCGA Data in R

R code for downloading data: TCGA Assembler
Kidney Renal Cell Carcinoma (KIRC), course website hosts
1. Code for using TCGA Assembler to download / clean data
2. Cleaned data set
3. Code to do some exploratory data analysis on cleaned data
Note: For homework 5, will only need to use cleaned data set
Survival Time is Days from Diagnosis to Death
- Right censored for living patients

Genomic and Survival Time Associations

Patients were grouped into 4 clusters based on mRNA expression levels
Clusters correlated with survival times

Source: https://www.nature.com/articles/nature12222

Predicting Survival Time with Genomic Data

Our goal for data set: Given 20,000+ mRNA expression and \(\approx 30\) SNV indicators, predict survival time
Possible Uses of Predictive Model:
- Inform patients about likely survival time
- Guide treatment: If survival time likely very long, can pursue less aggressive treatment
- Understand the molecular drivers of cancer

High Dimensional Inference

Definition of Data Dimension

Dimension of data is number of variables measured per observation
- usually denoted by p
- e.g. \(X \in \mathbb{R}^{n \times p}\) is data with \(n\) observations and \(p\) number of variables
Classical setting: \(n\) is much larger than \(p\)
- Example: measure treatment (0/1), race (4 categories), blood pressure, and survival for 450 patients
  - \(n=450\), \(p=5\) in Cox PH model with no interactions
High dimensional data: \(p \approx n\) or \(p > n\)
- Example: TCGA \(p=20,000+\) gene expressions for each patient but only \(n\approx 400\) patients

What Happends in High Dimensions?

Estimates become unstable for many models when \(p \approx n\)
When \(p > n\) estimates may not longer be defined
Model results are difficult to interpret
Computation time grows

Cox Partial Log Likelihood in High Dimensions

Recall the Cox PH MLE with no ties:

\[ LL(\beta) = \sum_{i=1}^D (\beta^T Z_{(i)} - \log(\sum_{j \in R(t_i)} exp(\beta^T Z_j))) \] where \(t_1,\ldots,t_D\) denote the distinct death times, \(Z_{(i)}\) are covariates for individual who died at time \(t_i\), and \(R(t_i)\) is the set of individuals at risk at time \(t_i\).

Question: Suppose we attempt maximize the log likelihood for a data set where \(p > n\). What will happen?

\[ \beta_{MLE} = argmax_\beta LL(\beta) \]

Feature Screening and Multiple Testing

Distribution of \(\sim 20,000\) pvalues for univariate Cox PH fits:

What are the "significantly" associated gene expressions?

Summary and Preview

Directly applying tools from the classical, low-dimensional setting to the high dimensional setting will not work.
- Code will return errors
- Coefficient estimates highly unstable
- Unclear inferences
High dimensional problems have been an important topic in statistics for the last 25 years
Upcoming high dimensional methodology for TCGA
- Penalized Cox proportional hazards regression
- Variable Screening using Cox Models and False Discovery Rate Control
- Prediction based assessment of models