Using somatic variant richness to mine signals from rare variants in the cancer genome

To date, the vast preponderance of somatic variants observed in the cancer genome have been rare variants, and it is common in practice to encounter in a new tumor variants that have not been observed previously. Here we focus on probability estimation for encountering such hitherto unseen variants. We draw upon statistical methodology that has been developed in other fields of study, notably in species estimation in ecology, and word frequency estimation in computational linguistics. Analysis of whole-exome and targeted panel sequencing data sets reveal substantial variability in variant “richness” between genes that could be harnessed for clinically relevant problems. We quantify the variant-tissue association and show a strong gene-specific, lineage-dependent pattern of encountering new variants. This variability is largely determined by the proportion of observed variants that are rare. Our findings suggest that variants that occur at very low frequencies can harbor important signals that are clinically consequential. Sequencing cancer genomes reveals low frequency novel somatic variants without known function. Here, the authors leverage statistical methodology from the fields of computational linguistics and ecology to highlight the potentially important signals harboured by these novel variants that are often dismissed.

[1]  R. Fisher,et al.  The Relation Between the Number of Species and the Number of Individuals in a Random Sample of an Animal Population , 1943 .

[2]  I. Good THE POPULATION FREQUENCIES OF SPECIES AND THE ESTIMATION OF POPULATION PARAMETERS , 1953 .

[3]  I. Good,et al.  THE NUMBER OF NEW SPECIES, AND THE INCREASE IN POPULATION COVERAGE, WHEN A SAMPLE IS INCREASED , 1956 .

[4]  B. Efron,et al.  Estimating the number of unseen species: How many words did Shakespeare know? Biometrika 63 , 1976 .

[5]  L. Lin,et al.  A concordance correlation coefficient to evaluate reproducibility. , 1989, Biometrics.

[6]  B. Vogelstein,et al.  A genetic model for colorectal tumorigenesis , 1990, Cell.

[7]  Bert Vogelstein,et al.  APC mutations occur early during colorectal tumorigenesis , 1992, Nature.

[8]  William A. Gale,et al.  Good-Turing Frequency Estimation Without Tears , 1995, J. Quant. Linguistics.

[9]  S. Gabriel,et al.  High-throughput oncogene mutation profiling in human cancer , 2007, Nature Genetics.

[10]  P. McCullagh Estimating the Number of Unseen Species: How Many Words did Shakespeare Know? , 2008 .

[11]  I. Ionita-Laza,et al.  Estimating the number of unseen variants in the human genome , 2009, Proceedings of the National Academy of Sciences.

[12]  Abigail Wacher,et al.  Comprehensive assessment of T-cell receptor beta-chain diversity in alphabeta T cells. , 2009, Blood.

[13]  C. Sander,et al.  Mutual exclusivity analysis identifies oncogenic network modules. , 2012, Genome research.

[14]  Benjamin J. Raphael,et al.  De novo discovery of mutated driver pathways in cancer , 2011 .

[15]  David T. W. Jones,et al.  Signatures of mutational processes in human cancer , 2013, Nature.

[16]  Timothy Daley,et al.  Predicting the molecular complexity of sequencing libraries , 2013, Nature Methods.

[17]  Benjamin J. Raphael,et al.  CoMEt: a statistical approach to identify combinations of mutually exclusive alterations in cancer , 2015, Genome Biology.

[18]  V. Seshan,et al.  USING SOMATIC MUTATION DATA TO TEST TUMORS FOR CLONAL RELATEDNESS. , 2015, The annals of applied statistics.

[19]  A. Suresh,et al.  Optimal prediction of the number of unseen species , 2016, Proceedings of the National Academy of Sciences.

[20]  Benjamin J. Raphael,et al.  Erratum to: CoMEt: a statistical approach to identify combinations of mutually exclusive alterations in cancer , 2016, Genome Biology.

[21]  Donavan T. Cheng,et al.  Mutational Landscape of Metastatic Cancer Revealed from Prospective Clinical Sequencing of 10,000 Patients , 2017, Nature Medicine.

[22]  Universal Patterns of Selection in Cancer and Somatic Tissues , 2017, Cell.

[23]  Moriah H Nissan,et al.  OncoKB: A Precision Oncology Knowledge Base. , 2017, JCO precision oncology.

[24]  Li Ding,et al.  Scalable Open Science Approach for Mutation Calling of Tumor Exomes Using Multiple Genomic Pipelines. , 2018, Cell systems.

[25]  Kirstie J. Whitaker,et al.  Raincloud plots: a multi-platform tool for robust data visualization , 2018, PeerJ Prepr..

[26]  V. Seshan,et al.  Contralateral breast cancers: Independent cancers or metastases? , 2018, International journal of cancer.

[27]  Davide Poggiali,et al.  Raincloud plots: a multi-platform tool for robust data visualization , 2019, Wellcome open research.