New functionalities in the TCGAbiolinks package for the study and integration of cancer data from GDC and GTEx

The advent of Next-Generation Sequencing (NGS) technologies has opened new perspectives in deciphering the genetic mechanisms underlying complex diseases. Nowadays, the amount of genomic data is massive and substantial efforts and new tools are required to unveil the information hidden in the data. The Genomic Data Commons (GDC) Data Portal is a platform that contains different genomic studies including the ones from The Cancer Genome Atlas (TCGA) and the Therapeutically Applicable Research to Generate Effective Treatments (TARGET) initiatives, accounting for more than 40 tumor types originating from nearly 30000 patients. Such platforms, although very attractive, must make sure the stored data are easily accessible and adequately harmonized. Moreover, they have the primary focus on the data storage in a unique place, and they do not provide a comprehensive toolkit for analyses and interpretation of the data. To fulfill this urgent need, comprehensive but easily accessible computational methods for integrative analyses of genomic data that do not renounce a robust statistical and theoretical framework are required. In this context, the R/Bioconductor package TCGAbiolinks was developed, offering a variety of bioinformatics functionalities. Here we introduce new features and enhancements of TCGAbiolinks in terms of i) more accurate and flexible pipelines for differential expression analyses, ii) different methods for tumor purity estimation and filtering, iii) integration of normal samples from other platforms iv) support for other genomics datasets, exemplified here by the TARGET data. Evidence has shown that accounting for tumor purity is essential in the study of tumorigenesis, as these factors promote confounding behavior regarding differential expression analysis. With this in mind, we implemented these filtering procedures in TCGAbiolinks. Moreover, a limitation of some of the TCGA datasets is the unavailability or paucity of corresponding normal samples. We thus integrated into TCGAbiolinks the possibility to use normal samples from the Genotype-Tissue Expression (GTEx) project, which is another large-scale repository cataloging gene expression from healthy individuals. The new functionalities are available in the TCGAbiolinks version 2.8 and higher released in Bioconductor version 3.7.

[1]  Michele Ceccarelli,et al.  TCGA Workflow: Analyze cancer genomics and epigenomics data using Bioconductor packages [version 1; referees: 1 approved, 1 approved with reservations] , 2016 .

[2]  M. Samur RTCGAToolbox: A New Tool for Exporting TCGA Firehose Data , 2014, PloS one.

[3]  P. McGettigan Transcriptomics in the RNA-seq era. , 2013, Current opinion in chemical biology.

[4]  Robert Brown,et al.  TCGASpliceSeq a compendium of alternative mRNA splicing in cancer , 2015, Nucleic Acids Res..

[5]  Steven J. M. Jones,et al.  The Immune Landscape of Cancer , 2018, Immunity.

[6]  Joshua M. Stuart,et al.  The Cancer Genome Atlas Pan-Cancer analysis project , 2013, Nature Genetics.

[7]  Davis J. McCarthy,et al.  Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation , 2012, Nucleic acids research.

[8]  Alyssa C. Frazee,et al.  ReCount: A multi-experiment resource of analysis-ready RNA-seq gene count datasets , 2011, BMC Bioinformatics.

[9]  Robert L. Grossman,et al.  A Case for Data Commons: Toward Data Science as a Service , 2016, Computing in Science & Engineering.

[10]  Elena Papaleo,et al.  Cytokine profiling of tumor interstitial fluid of the breast and its relationship with lymphocyte infiltration and clinicopathological characteristics , 2016, Oncoimmunology.

[11]  Atul J. Butte,et al.  Digitally deconvolving the tumor microenvironment , 2016, Genome Biology.

[12]  Li Ding,et al.  The Pediatric Cancer Genome Project , 2012, Nature Genetics.

[13]  C. Hutter,et al.  The Cancer Genome Atlas: Creating Lasting Value beyond Its Data , 2018, Cell.

[14]  C. R. Leemans,et al.  Using tissue adjacent to carcinoma as a normal control: an obvious but questionable practice , 2004, The Journal of pathology.

[15]  D. Haussler,et al.  Exploring TCGA Pan-Cancer Data at the UCSC Cancer Genomics Browser , 2013, Scientific Reports.

[16]  Jeffrey T Leek,et al.  Reproducible RNA-seq analysis using recount2 , 2017, Nature Biotechnology.

[17]  Matthew E. Ritchie,et al.  limma powers differential expression analyses for RNA-sequencing and microarray studies , 2015, Nucleic acids research.

[18]  M. Gerstein,et al.  RNA-Seq: a revolutionary tool for transcriptomics , 2009, Nature Reviews Genetics.

[19]  Gianluca Bontempi,et al.  TCGAbiolinks: an R/Bioconductor package for integrative analysis of TCGA data , 2015, Nucleic acids research.

[20]  Sandrine Dudoit,et al.  GC-Content Normalization for RNA-Seq Data , 2011, BMC Bioinformatics.

[21]  K. Polyak,et al.  Intra-tumour heterogeneity: a looking glass for cancer? , 2012, Nature Reviews Cancer.

[22]  Michele Ceccarelli,et al.  TCGAbiolinksGUI: A graphical user interface to analyze GDC cancer molecular and clinical data , 2017, bioRxiv.

[23]  Raphael Gottardo,et al.  Orchestrating high-throughput genomic analysis with Bioconductor , 2015, Nature Methods.

[24]  G. Berx,et al.  Involvement of members of the cadherin superfamily in cancer. , 2009, Cold Spring Harbor perspectives in biology.

[25]  Cheng Li,et al.  Adjusting batch effects in microarray expression data using empirical Bayes methods. , 2007, Biostatistics.

[26]  Steven J. M. Jones,et al.  Comprehensive Molecular Portraits of Invasive Lobular Breast Cancer , 2015, Cell.

[27]  Leonardo Collado-Torres,et al.  Rail-RNA: Scalable analysis of RNA-seq splicing and coverage , 2015, bioRxiv.

[28]  Charity W. Law,et al.  voom: precision weights unlock linear model analysis tools for RNA-seq read counts , 2014, Genome Biology.

[29]  Robert Budden,et al.  TCGA Expedition: A Data Acquisition and Management System for TCGA Data , 2016, PloS one.

[30]  Leonardo Collado-Torres,et al.  recount workflow: Accessing over 70,000 human RNA-seq samples with Bioconductor , 2017, F1000Research.

[31]  Liguo Zhang,et al.  Unifying cancer and normal RNA sequencing data from different sources , 2018, Scientific Data.

[32]  M. Marra,et al.  Next-Generation Sequencing Approaches in Cancer: Where Have They Brought Us and Where Will They Take Us? , 2015, Cancers.

[33]  A. McKenna,et al.  Absolute quantification of somatic DNA alterations in human cancer , 2012, Nature Biotechnology.

[34]  Ellen T. Gelfand,et al.  The Genotype-Tissue Expression (GTEx) project , 2013, Nature Genetics.

[35]  Jordan Anaya OncoLnc: linking TCGA survival data to mRNAs, miRNAs, and lncRNAs , 2016, PeerJ Comput. Sci..

[36]  Joachim L. Schultze,et al.  Web-TCGA: an online platform for integrated analysis of molecular cancer data sets , 2016, BMC Bioinformatics.

[37]  Cheng Li,et al.  GEPIA: a web server for cancer and normal gene expression profiling and interactive analyses , 2017, Nucleic Acids Res..

[38]  G. Getz,et al.  Inferring tumour purity and stromal and immune cell admixture from expression data , 2013, Nature Communications.

[39]  Juli D. Klemm,et al.  A Comprehensive Infrastructure for Big Data in Cancer Research: Accelerating Cancer Research and Precision Medicine , 2017, Front. Cell Dev. Biol..

[40]  C. Pieterse,et al.  RNA-Seq: revelation of the messengers. , 2013, Trends in plant science.

[41]  Ruijiang Li,et al.  A survey and evaluation of Web-based tools/databases for variant analysis of TCGA data , 2018, Briefings Bioinform..

[42]  Gianluca Bontempi,et al.  TCGA Workflow: Analyze cancer genomics and epigenomics data using Bioconductor packages , 2016, F1000Research.

[43]  Petr Busek,et al.  Dipeptidyl peptidase IV activity and/or structure homologues (DASH) and their substrates in cancer. , 2004, The international journal of biochemistry & cell biology.

[44]  A. Fujimoto,et al.  Cancer whole-genome sequencing: present and future , 2015, Oncogene.

[45]  Elena Papaleo,et al.  N‐glycan signatures identified in tumor interstitial fluid and serum of breast cancer patients: association with tumor biology and clinical outcome , 2018, Molecular oncology.

[46]  N. McGranahan,et al.  The causes and consequences of genetic heterogeneity in cancer evolution , 2013, Nature.

[47]  A. Butte,et al.  Systematic pan-cancer analysis of tumour purity , 2015, Nature Communications.

[48]  K. Tomczak,et al.  The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge , 2015, Contemporary oncology.

[49]  Rachel G Liao,et al.  Facilitating a culture of responsible and effective sharing of cancer genome data , 2016, Nature Medicine.

[50]  H. Salis,et al.  Automated physics-based design of synthetic riboswitches from diverse RNA aptamers , 2015, Nucleic acids research.

[51]  Yuan Ji,et al.  TCGA-Assembler 2: Software Pipeline for Retrieval and Processing of TCGA/CPTAC Data , 2017, bioRxiv.

[52]  Joshua M. Stuart,et al.  Machine Learning Identifies Stemness Features Associated with Oncogenic Dedifferentiation. , 2018, Cell.

[53]  T. Whiteside The tumor microenvironment and its role in promoting tumor growth , 2008, Oncogene.

[54]  Andrew E. Jaffe,et al.  Bioinformatics Applications Note Gene Expression the Sva Package for Removing Batch Effects and Other Unwanted Variation in High-throughput Experiments , 2022 .

[55]  Maode Lai,et al.  TSVdb: a web-tool for TCGA splicing variants analysis , 2018, BMC Genomics.

[56]  Y. Okada,et al.  ADAMs in cancer cell proliferation and progression , 2007, Cancer science.

[57]  Mark D. Robinson,et al.  edgeR: a Bioconductor package for differential expression analysis of digital gene expression data , 2009, Bioinform..

[58]  George S. Krasnov,et al.  CrossHub: a tool for multi-way analysis of The Cancer Genome Atlas (TCGA) in the context of gene expression regulation mechanisms , 2016, Nucleic acids research.

[59]  Genevera I. Allen,et al.  TCGA2STAT: simple TCGA data access for integrated statistical analysis in R , 2016, Bioinform..

[60]  V. Moreno,et al.  CC-122 immunomodulatory effects in refractory patients with diffuse large B-cell lymphoma , 2016, Oncoimmunology.