Scalable Open Science Approach for Mutation Calling of Tumor Exomes Using Multiple Genomic Pipelines.

The Cancer Genome Atlas (TCGA) cancer genomics dataset includes over 10,000 tumor-normal exome pairs across 33 different cancer types, in total >400 TB of raw data files requiring analysis. Here we describe the Multi-Center Mutation Calling in Multiple Cancers project, our effort to generate a comprehensive encyclopedia of somatic mutation calls for the TCGA data to enable robust cross-tumor-type analyses. Our approach accounts for variance and batch effects introduced by the rapid advancement of DNA extraction, hybridization-capture, sequencing, and analysis methods over time. We present best practices for applying an ensemble of seven mutation-calling algorithms with scoring and artifact filtering. The dataset created by this analysis includes 3.5 million somatic variants and forms the basis for PanCan Atlas papers. The results have been made available to the research community along with the methods used to generate them. This project is the result of collaboration from a number of institutes and demonstrates how team science drives extremely large genomics projects.

[1]  Steven A. Roberts,et al.  Mutational heterogeneity in cancer and the search for new cancer-associated genes , 2013 .

[2]  Terence P. Speed,et al.  Comparing somatic mutation-callers: beyond Venn diagrams , 2013, BMC Bioinformatics.

[3]  J C Costello,et al.  Seeking the Wisdom of Crowds Through Challenge‐Based Competitions in Biomedical Research , 2013, Clinical pharmacology and therapeutics.

[4]  C. Turnbull,et al.  Introducing whole-genome sequencing into routine cancer care: the Genomics England 100 000 Genomes Project. , 2018, Annals of oncology : official journal of the European Society for Medical Oncology.

[5]  P. A. Futreal,et al.  MuSE: accounting for tumor heterogeneity using a sample-specific error model improves sensitivity and specificity in mutation calling from sequencing data , 2016, Genome Biology.

[6]  A. Sivachenko,et al.  Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples , 2013, Nature Biotechnology.

[7]  Benjamin J. Raphael,et al.  Mutational landscape and significance across 12 major cancer types , 2013, Nature.

[8]  Joshua M. Stuart,et al.  RADIA: RNA and DNA Integrated Analysis for Somatic Mutation Detection , 2014, PloS one.

[9]  Project GENIE Goes Public. , 2017, Cancer discovery.

[10]  James Y. Zou Analysis of protein-coding genetic variation in 60,706 humans , 2015, Nature.

[11]  Kai Ye,et al.  Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads , 2009, Bioinform..

[12]  Icgc,et al.  Pan-cancer analysis of whole genomes , 2017, bioRxiv.

[13]  Mingming Jia,et al.  COSMIC: exploring the world's knowledge of somatic mutations in human cancer , 2014, Nucleic Acids Res..

[14]  Elizabeth M. Smigielski,et al.  dbSNP: the NCBI database of genetic variation , 2001, Nucleic Acids Res..

[15]  Georgina L Ryland,et al.  A simple consensus approach improves somatic mutation prediction accuracy , 2013, Genome Medicine.

[16]  Matthew B. Callaway,et al.  MuSiC: Identifying mutational significance in cancer genomes , 2012, Genome research.

[17]  Trevor J Pugh,et al.  Initial genome sequencing and analysis of multiple myeloma , 2011, Nature.

[18]  Joshua M. Stuart,et al.  Combining tumor genome simulation with crowdsourcing to benchmark somatic single-nucleotide-variant detection , 2015, Nature Methods.

[19]  F. Cunningham,et al.  The Ensembl Variant Effect Predictor , 2016, Genome Biology.

[20]  T. Graubert,et al.  Genomics in childhood acute myeloid leukemia comes of age , 2018, Nature Medicine.

[21]  Steven J. M. Jones,et al.  Comprehensive genomic characterization of squamous cell lung cancers , 2012, Nature.

[22]  Steven J. M. Jones,et al.  Integrated Genomic Characterization of Pancreatic Ductal Adenocarcinoma. , 2017, Cancer cell.

[23]  Doron Lipson,et al.  High-Throughput Genomic Profiling of Adult Solid Tumors Reveals Novel Insights into Cancer Pathogenesis. , 2017, Cancer research.

[24]  Steven J. M. Jones,et al.  Comprehensive Characterization of Cancer Driver Genes and Mutations , 2018, Cell.

[25]  L. Stein The case for cloud computing in genome informatics , 2010, Genome Biology.

[26]  Sarah C. Ayling,et al.  The Ensembl gene annotation system , 2016, Database J. Biol. Databases Curation.

[27]  Michael T. Zimmermann,et al.  Genomic and Molecular Landscape of DNA Damage Repair Deficiency across The Cancer Genome Atlas , 2018, Cell reports.

[28]  Steven A. Roberts,et al.  Mutational heterogeneity in cancer and the search for new cancer genes , 2014 .

[29]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[30]  Steven J. M. Jones,et al.  Genomic Classification of Cutaneous Melanoma , 2015, Cell.

[31]  Steven J. M. Jones,et al.  Comprehensive molecular profiling of lung adenocarcinoma , 2014, Nature.

[32]  Adam A. Margolin,et al.  The Cancer Cell Line Encyclopedia enables predictive modeling of anticancer drug sensitivity , 2012, Nature.

[33]  Alexander Lex,et al.  UpSetR: an R package for the visualization of intersecting sets and their properties , 2017, bioRxiv.

[34]  Ken Chen,et al.  SomaticSniper: identification of somatic point mutations in whole genome sequencing data , 2012, Bioinform..

[35]  Christopher A. Miller,et al.  VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. , 2012, Genome research.

[36]  Steven J. M. Jones,et al.  The Immune Landscape of Cancer , 2018, Immunity.

[37]  Trevor J Pugh,et al.  Discovery and characterization of artifactual mutations in deep coverage targeted capture sequencing data due to oxidative DNA damage during sample preparation , 2013, Nucleic acids research.