Exploring Drivers of Gene Expression in The Cancer Genome Atlas

Motivation: The Cancer Genome Atlas (TCGA) has greatly advanced cancer research by generating, curating and publicly releasing deeply measured molecular data from thousands of tumor samples. In particular, gene expression measures, both within and across cancer types, have been used to determine the genes and proteins that are active in tumor cells. Results: To more thoroughly investigate the behavior of gene expression in TCGA tumor samples, we introduce a statistical framework for partitioning the variation in gene expression due to a variety of molecular variables including somatic mutations, transcription factors (TFs), microRNAs, copy number alternations, methylation and germ‐line genetic variation. As proof‐of‐principle, we identify and validate specific TFs that influence the expression of PTPN14 in breast cancer cells. Availability and implementation: We provide a freely available, user‐friendly, browseable interactive web‐based application for exploring the results of our transcriptome‐wide analyses across 17 different cancers in TCGA at http://ls‐shiny‐prod.uwm.edu/edge_in_tcga. All TCGA Open Access tier data are available at the Broad Institute GDAC Firehose and were downloaded using the TCGA2STAT R package. TCGA Controlled Access tier data are available via controlled access through the Genomic Data Commons (GDC). R scripts used to download, format and analyze the data and produce the interactive R/Shiny web app have been made available on GitHub at https://github.com/andreamrau/EDGE‐in‐TCGA.

[1]  Jun S. Liu,et al.  Inference of transcriptional regulation in cancers , 2015, Proceedings of the National Academy of Sciences.

[2]  R. Tibshirani,et al.  Sparse Principal Component Analysis , 2006 .

[3]  David M. Simcha,et al.  Tackling the widespread and critical impact of batch effects in high-throughput data , 2010, Nature Reviews Genetics.

[4]  P. Duesberg,et al.  Aneuploidy vs. gene mutation hypothesis of cancer: recent study claims mutation but is found to support aneuploidy. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[5]  Adam A. Margolin,et al.  Enabling transparent and collaborative computational analysis of 12 tumor types within The Cancer Genome Atlas , 2013, Nature Genetics.

[6]  G. Rancati,et al.  Aneuploidy and chromosomal instability in cancer: a jackpot to chaos , 2015, Cell Division.

[7]  Kaanan P. Shah,et al.  A gene-based association method for mapping traits using reference transcriptome data , 2015, Nature Genetics.

[8]  Xiaowei Wang,et al.  miRDB: an online resource for microRNA target prediction and functional annotations , 2014, Nucleic Acids Res..

[9]  C. Langford,et al.  Distinct patterns of 1p and 19q alterations identify subtypes of human gliomas that have different prognoses. , 2010, Neuro-oncology.

[10]  L. Banks,et al.  The PTPN14 Tumor Suppressor Is a Degradation Target of Human Papillomavirus E7 , 2017, Journal of Virology.

[11]  Florian Rohart,et al.  mixOmics: an R package for ‘omics feature selection and multiple data integration , 2017 .

[12]  Naveid A Ali,et al.  The tyrosine phosphatase PTPN14 (Pez) inhibits metastasis by altering protein trafficking , 2015, Science Signaling.

[13]  J. Uhm Comprehensive genomic characterization defines human glioblastoma genes and core pathways , 2009 .

[14]  Joshua M. Stuart,et al.  The Cancer Genome Atlas Pan-Cancer analysis project , 2013, Nature Genetics.

[15]  Jean-Philippe Vert,et al.  Changes in correlation between promoter methylation and gene expression in cancer , 2015, BMC Genomics.

[16]  Roland Eils,et al.  Complex heatmaps reveal patterns and correlations in multidimensional genomic data , 2016, Bioinform..

[17]  D. Bates,et al.  Mixed-Effects Models in S and S-PLUS , 2001 .

[18]  Qiang Hu,et al.  Genetic variations in the Hippo signaling pathway and breast cancer risk in African American women in the AMBER Consortium. , 2016, Carcinogenesis.

[19]  Genevera I. Allen,et al.  TCGA2STAT: simple TCGA data access for integrated statistical analysis in R , 2016, Bioinform..

[20]  Hadley Wickham,et al.  ggplot2 - Elegant Graphics for Data Analysis (2nd Edition) , 2017 .

[21]  Steven J. M. Jones,et al.  Comprehensive molecular portraits of human breast tumours , 2013 .

[22]  Jung Eun Shim,et al.  TRRUST: a reference database of human transcriptional regulatory interactions , 2015, Scientific Reports.

[23]  variancePartition: Interpreting drivers of variation in complex gene expression studies , 2016 .

[24]  P. Visscher,et al.  GCTA: a tool for genome-wide complex trait analysis. , 2011, American journal of human genetics.

[25]  David A. Knowles,et al.  RNA splicing is a primary link between genetic variation and disease , 2016, Science.

[26]  Gary D Bader,et al.  Systematic analysis of somatic mutations impacting gene expression in 12 tumour types , 2015, Nature Communications.

[27]  Colin N. Dewey,et al.  RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome , 2011, BMC Bioinformatics.

[28]  Kenneth Benoit,et al.  Fast, Consistent Tokenization of Natural Language Text , 2018, J. Open Source Softw..

[29]  Roland Eils,et al.  circlize implements and enhances circular visualization in R , 2014, Bioinform..

[30]  Alkes L. Price,et al.  Integrative approaches for large-scale transcriptome-wide association studies , 2015 .

[31]  Lu Tian,et al.  An integrated network of microRNA and gene expression in ovarian cancer , 2015, BMC Bioinformatics.

[32]  Chris Sander,et al.  Emerging landscape of oncogenic signatures across human cancers , 2013, Nature Genetics.

[33]  Shinichi Nakagawa,et al.  A general and simple method for obtaining R2 from generalized linear mixed‐effects models , 2013 .

[34]  Pedro G. Ferreira,et al.  Transcriptome and genome sequencing uncovers functional variation in humans , 2013, Nature.

[35]  L. Wood,et al.  A p53 Super-tumor Suppressor Reveals a Tumor Suppressive p53-Ptpn14-Yap Axis in Pancreatic Cancer. , 2017, Cancer cell.