Uniform genomic data analysis in the NCI Genomic Data Commons

The goal of the National Cancer Institute (NCI) Genomic Data Commons (GDC) is to provide the cancer research community with a data repository of uniformly processed genomic and associated clinical data that enables data sharing and collaborative analysis in the support of precision medicine. The initial GDC dataset include genomic, epigenomic, proteomic, clinical and other data from the NCI TCGA and TARGET programs. Data production for the GDC started in June, 2015 using an OpenStack-based private cloud. By June of 2016, the GDC had analyzed more than 50,000 raw sequencing data inputs, as well as multiple other data types. Using the latest human genome reference build GRCh38, the GDC generated a variety of data types from aligned reads to somatic mutations, gene expression, miRNA expression, DNA methylation status, and copy number variation. In this paper, we describe the pipelines and workflows used to process and harmonize the data in the GDC. The generated data, as well as the original input files from TCGA and TARGET, are available for download and exploratory analysis at the GDC Data Portal and Legacy Archive (https://gdc.cancer.gov/).

[1]  F. Cunningham,et al.  The Ensembl Variant Effect Predictor , 2016, Genome Biology.

[2]  Lin He,et al.  In-depth comparison of somatic point mutation callers based on different tumor next-generation sequencing depth data , 2016, Scientific Reports.

[3]  W. Huber,et al.  Detecting differential usage of exons from RNA-seq data , 2012, Genome research.

[4]  Alkes L. Price,et al.  Using population admixture to help complete maps of the human genome , 2013, Nature Genetics.

[5]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[6]  Vivek Gopalan,et al.  The Papillomavirus Episteme: a central resource for papillomavirus sequence data and analysis , 2012, Nucleic Acids Res..

[7]  Marcel J T Reinders,et al.  Pan-cancer subtyping in a 2D-map shows substructures that are driven by specific combinations of molecular characteristics , 2016, Scientific Reports.

[8]  M. DePristo,et al.  A framework for variation discovery and genotyping using next-generation DNA sequencing data , 2011, Nature Genetics.

[9]  P. A. Futreal,et al.  MuSE: accounting for tumor heterogeneity using a sample-specific error model improves sensitivity and specificity in mutation calling from sequencing data , 2016, Genome Biology.

[10]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[11]  Joshua M. Korn,et al.  Integrated genotype calling and association analysis of SNPs, common copy number polymorphisms and rare CNVs , 2008, Nature Genetics.

[12]  W. Huber,et al.  Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 , 2014, Genome Biology.

[13]  Ana Kozomara,et al.  miRBase: annotating high confidence microRNAs using deep sequencing data , 2013, Nucleic Acids Res..

[14]  Shu-Bing Qian,et al.  Quantitative profiling of initiating ribosomes in vivo , 2014, Nature Methods.

[15]  Michael C. Heinold,et al.  A comprehensive assessment of somatic mutation detection in cancer using whole-genome sequencing , 2015, Nature Communications.

[16]  Peter W. Laird,et al.  Comprehensive characterization, annotation and innovative use of Infinium DNA methylation BeadChip probes , 2016, Nucleic acids research.

[17]  Allison P. Heath,et al.  Toward a Shared Vision for Cancer Genomic Data. , 2016, The New England journal of medicine.

[18]  Heng Li Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM , 2013, 1303.3997.

[19]  Mauricio O. Carneiro,et al.  From FastQ Data to High‐Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline , 2013, Current protocols in bioinformatics.

[20]  B. Williams,et al.  Mapping and quantifying mammalian transcriptomes by RNA-Seq , 2008, Nature Methods.

[21]  Ken Chen,et al.  SomaticSniper: identification of somatic point mutations in whole genome sequencing data , 2012, Bioinform..

[22]  Steven J. M. Jones,et al.  Comprehensive molecular characterization of human colon and rectal cancer , 2012, Nature.

[23]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[24]  Steven J. M. Jones,et al.  Large-scale profiling of microRNAs for The Cancer Genome Atlas , 2015, Nucleic acids research.

[25]  German Tischler,et al.  biobambam: tools for read pair collation based algorithms on BAM files , 2013, Source Code for Biology and Medicine.

[26]  Mark D. Robinson,et al.  edgeR: a Bioconductor package for differential expression analysis of digital gene expression data , 2009, Bioinform..

[27]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[28]  Ravi Vijaya Satya,et al.  Comparison of somatic mutation calling methods in amplicon and whole exome sequence data , 2014, BMC Genomics.

[29]  Trevor J Pugh,et al.  Discovery and characterization of artifactual mutations in deep coverage targeted capture sequencing data due to oxidative DNA damage during sample preparation , 2013, Nucleic acids research.

[30]  Bronwen L. Aken,et al.  GENCODE: The reference human genome annotation for The ENCODE Project , 2012, Genome research.

[31]  Christopher A. Miller,et al.  VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. , 2012, Genome research.

[32]  K. Sirotkin,et al.  The NCBI dbGaP database of genotypes and phenotypes , 2007, Nature Genetics.

[33]  A. Sivachenko,et al.  Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples , 2013, Nature Biotechnology.

[34]  Yan Guo,et al.  Improvements and impacts of GRCh38 human reference on high throughput sequencing data analysis. , 2017, Genomics.

[35]  Paul Theodor Pyl,et al.  HTSeq—a Python framework to work with high-throughput sequencing data , 2014, bioRxiv.

[36]  Peilin Jia,et al.  Detecting somatic point mutations in cancer genome sequencing data: a comparison of mutation callers , 2013, Genome Medicine.

[37]  Gonçalo R. Abecasis,et al.  Unified representation of genetic variants , 2015, Bioinform..

[38]  Benjamin J. Raphael,et al.  Integrated genomic characterization of oesophageal carcinoma , 2017, Nature.

[39]  G. Church,et al.  Sequencing genomes from single cells by polymerase cloning , 2006, Nature Biotechnology.

[40]  M. Wigler,et al.  Circular binary segmentation for the analysis of array-based DNA copy number data. , 2004, Biostatistics.

[41]  Thomas R. Gingeras,et al.  STAR: ultrafast universal RNA-seq aligner , 2013, Bioinform..

[42]  C. Amos,et al.  RNA-Seq Analysis of Differential Splice Junction Usage and Intron Retentions by DEXSeq , 2015, PloS one.