3/26/2019

The Cancer Genome Atlas (TCGA)

What is TCGA?

Comprehensive Cancer Genomic Data Set

  • Project began in 2005 with mission to "obtain a comprehensive understanding of the genomic alterations that underlie all major cancers"
  • 68 cancer primary sites
  • 33,549 patients
  • Genomic, transcriptomic, proteomic, methylation, images for most patients
  • Most data is freely available:

TCGA Science

TCGA Data

  • Clinical Data: Dozens of variables per patients
    • Cancer type, age at diagnosis, cancer stage, survival time
  • Genomic Data: Tens of thousands of variables per case
    • We focus on Somatic SNVs and Gene Expression
  • Combination of Clinical and Genomic data for large number of cancer types makes TCGA a unique resource

Single Nucleotide Variant (SNV)

  • DNA contains approximately 3 billion base pairs (T,C,G,orA) in the chromosomes of each cell
  • Base pairs are mostly the same in different cells
  • Cancer cells have altered base pairs, known as somatic Single Nucleotide Variants (SNV)
  • SNVs may drive the growth and division of cells

Image Source: https://en.wikipedia.org/wiki/Single-nucleotide_polymorphism

Gene Expression Data

  • "A gene is a sequence of nucleotides in DNA or RNA that codes for a molecule that has a function" Wikipedia
  • ~20,000 genes in human body
  • The expression level of a gene is, roughly, the number of RNA sequences in a cell corresponding to a specific gene

Image Source: https://www.ncbi.nlm.nih.gov/probe/docs/applexpression/

Assembling TCGA Data in R

  • R code for downloading data: TCGA Assembler
  • Kidney Renal Cell Carcinoma (KIRC), course website hosts
    1. Code for using TCGA Assembler to download / clean data
    2. Cleaned data set
    3. Code to do some exploratory data analysis on cleaned data
  • Note: For homework 5, will only need to use cleaned data set
  • Survival Time is Days from Diagnosis to Death
    • Right censored for living patients

Genomic and Survival Time Associations

Predicting Survival Time with Genomic Data

  • Our goal for data set: Given 20,000+ mRNA expression and \(\approx 30\) SNV indicators, predict survival time
  • Possible Uses of Predictive Model:
    • Inform patients about likely survival time
    • Guide treatment: If survival time likely very long, can pursue less aggressive treatment
    • Understand the molecular drivers of cancer

High Dimensional Inference

Definition of Data Dimension

  • Dimension of data is number of variables measured per observation
    • usually denoted by p
    • e.g. \(X \in \mathbb{R}^{n \times p}\) is data with \(n\) observations and \(p\) number of variables
  • Classical setting: \(n\) is much larger than \(p\)
    • Example: measure treatment (0/1), race (4 categories), blood pressure, and survival for 450 patients
      • \(n=450\), \(p=5\) in Cox PH model with no interactions
  • High dimensional data: \(p \approx n\) or \(p > n\)
    • Example: TCGA \(p=20,000+\) gene expressions for each patient but only \(n\approx 400\) patients

What Happends in High Dimensions?

  • Estimates become unstable for many models when \(p \approx n\)
  • When \(p > n\) estimates may not longer be defined
  • Model results are difficult to interpret
  • Computation time grows

Cox Partial Log Likelihood in High Dimensions

Recall the Cox PH MLE with no ties:

\[ LL(\beta) = \sum_{i=1}^D (\beta^T Z_{(i)} - \log(\sum_{j \in R(t_i)} exp(\beta^T Z_j))) \] where \(t_1,\ldots,t_D\) denote the distinct death times, \(Z_{(i)}\) are covariates for individual who died at time \(t_i\), and \(R(t_i)\) is the set of individuals at risk at time \(t_i\).

Question: Suppose we attempt maximize the log likelihood for a data set where \(p > n\). What will happen?

\[ \beta_{MLE} = argmax_\beta LL(\beta) \]

Feature Screening and Multiple Testing

Distribution of \(\sim 20,000\) pvalues for univariate Cox PH fits:

What are the "significantly" associated gene expressions?

Summary and Preview

  • Directly applying tools from the classical, low-dimensional setting to the high dimensional setting will not work.
    • Code will return errors
    • Coefficient estimates highly unstable
    • Unclear inferences
  • High dimensional problems have been an important topic in statistics for the last 25 years
  • Upcoming high dimensional methodology for TCGA
    • Penalized Cox proportional hazards regression
    • Variable Screening using Cox Models and False Discovery Rate Control
    • Prediction based assessment of models