The Cancer Genome Atlas (TCGA)

What is TCGA?

Comprehensive Cancer Genomic Data Set

  • Project began in 2005 with mission to "obtain a comprehensive understanding of the genomic alterations that underlie all major cancers"
  • 68 cancer primary sites
  • 33,549 patients
  • Genomic, transcriptomic, proteomic, methylation, images for most patients
  • Most data is freely available:

TCGA Science


  • Clinical Data: Dozens of variables per patients
    • Cancer type, age at diagnosis, cancer stage, survival time
  • Genomic Data: Tens of thousands of variables per case
    • We focus on Somatic SNVs and Gene Expression
  • Combination of Clinical and Genomic data for large number of cancer types makes TCGA a unique resource

Single Nucleotide Variant (SNV)

  • DNA contains approximately 3 billion base pairs (T,C,G,orA) in the chromosomes of each cell
  • Base pairs are mostly the same in different cells
  • Cancer cells have altered base pairs, known as somatic Single Nucleotide Variants (SNV)
  • SNVs may drive the growth and division of cells

Image Source: https://en.wikipedia.org/wiki/Single-nucleotide_polymorphism

Gene Expression Data

  • "A gene is a sequence of nucleotides in DNA or RNA that codes for a molecule that has a function" Wikipedia
  • ~20,000 genes in human body
  • The expression level of a gene is, roughly, the number of RNA sequences in a cell corresponding to a specific gene