Evaluating the evaluation of cancer driver genes

Significance Modern large-scale sequencing of human cancers seeks to comprehensively discover mutated genes that confer a selective advantage to cancer cells. Key to this effort has been development of computational algorithms to find genes that drive cancer based on their patterns of mutation in large patient cohorts. Because there is no generally accepted gold standard of driver genes, it has been difficult to quantitatively compare these methods. We present a machine-learning–based method for driver gene prediction and a protocol to evaluate and compare prediction methods. Our results suggest that most current methods do not adequately account for heterogeneity in the number of mutations expected by chance and consequently yield many false-positive calls, particularly in cancers with high mutation rate. Sequencing has identified millions of somatic mutations in human cancers, but distinguishing cancer driver genes remains a major challenge. Numerous methods have been developed to identify driver genes, but evaluation of the performance of these methods is hindered by the lack of a gold standard, that is, bona fide driver gene mutations. Here, we establish an evaluation framework that can be applied to driver gene prediction methods. We used this framework to compare the performance of eight such methods. One of these methods, described here, incorporated a machine-learning–based ratiometric approach. We show that the driver genes predicted by each of the eight methods vary widely. Moreover, the P values reported by several of the methods were inconsistent with the uniform values expected, thus calling into question the assumptions that were used to generate them. Finally, we evaluated the potential effects of unexplained variability in mutation rates on false-positive driver gene predictions. Our analysis points to the strengths and weaknesses of each of the currently available methods and offers guidance for improving them in the future.

[1]  Elie Bienenstock,et al.  Neural Networks and the Bias/Variance Dilemma , 1992, Neural Computation.

[2]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[3]  Yali Amit,et al.  Shape Quantization and Recognition with Randomized Trees , 1997, Neural Computation.

[4]  angesichts der Corona-Pandemie,et al.  UPDATE , 1973, The Lancet.

[5]  B S Weir,et al.  Truncated product method for combining P‐values , 2002, Genetic epidemiology.

[6]  M. Chernick,et al.  The Saw-Toothed Behavior of Power Versus Sample Size and Software Solutions , 2002 .

[7]  Terrence S. Furey,et al.  The UCSC Genome Browser Database , 2003, Nucleic Acids Res..

[8]  John D. Storey,et al.  Statistical significance for genomewide studies , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[9]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology Information: update , 2004, Nucleic acids research.

[10]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[11]  T. Hubbard,et al.  A census of human cancer genes , 2004, Nature Reviews Cancer.

[12]  G. Parmigiani,et al.  The Consensus Coding Sequences of Human Breast and Colorectal Cancers , 2006, Science.

[13]  A. Sparks,et al.  The Genomic Landscapes of Human Breast and Colorectal Cancers , 2007, Science.

[14]  G. Parmigiani,et al.  Design and analysis issues in genome-wide somatic mutation studies of cancer. , 2009, Genomics.

[15]  Leyla Isik,et al.  Cancer-specific high-throughput annotation of somatic mutations: computational prediction of driver missense mutations. , 2009, Cancer research.

[16]  Peter J. Bickel,et al.  Measuring reproducibility of high-throughput experiments , 2011, 1110.4705.

[17]  Hannah Carter,et al.  CHASM and SNVBox: toolkit for detecting biologically important single nucleotide mutations in cancer , 2011, Bioinform..

[18]  Mingming Jia,et al.  COSMIC: mining complete cancer genomes in the Catalogue of Somatic Mutations in Cancer , 2010, Nucleic Acids Res..

[19]  A. Gonzalez-Perez,et al.  Functional impact bias reveals cancer drivers , 2012, Nucleic acids research.

[20]  B. Schuster-Böckler,et al.  Chromatin organization is a major influence on regional mutation rates in human cancer cells , 2012, Nature.

[21]  Matthew B. Callaway,et al.  MuSiC: Identifying mutational significance in cancer genomes , 2012, Genome research.

[22]  S. Elledge,et al.  Cumulative Haploinsufficiency and Triplosensitivity Drive Aneuploidy Patterns and Shape the Cancer Genome , 2013, Cell.

[23]  P. A. Futreal,et al.  Emerging patterns of somatic mutations in cancer , 2013, Nature Reviews Genetics.

[24]  David T. W. Jones,et al.  Signatures of mutational processes in human cancer , 2013, Nature.

[25]  K. Kinzler,et al.  Cancer Genome Landscapes , 2013, Science.

[26]  Hannah Carter,et al.  CRAVAT: cancer-related analysis of variants toolkit , 2013, Bioinform..

[27]  Steven A. Roberts,et al.  Mutational heterogeneity in cancer and the search for new cancer-associated genes , 2013 .

[28]  David Tamborero,et al.  OncodriveCLUST: exploiting the positional clustering of somatic mutations to identify cancer genes , 2013, Bioinform..

[29]  Giovanni Parmigiani,et al.  Half or more of the somatic mutations in cancers of self-renewing tissues originate prior to tumor initiation , 2013, Proceedings of the National Academy of Sciences.

[30]  Gary D Bader,et al.  Comprehensive identification of mutational cancer driver genes across 12 tumor types , 2013, Scientific Reports.

[31]  Gary D Bader,et al.  Systematic analysis of somatic mutations in phosphorylation signaling predicts novel cancer drivers , 2013 .

[32]  Luca Mazzarella,et al.  DOTS-Finder: a comprehensive tool for assessing driver genes in cancer genomes , 2014, Genome Medicine.

[33]  S. Gabriel,et al.  Discovery and saturation analysis of cancer genes across 21 tumor types , 2014, Nature.

[34]  Ben Lehner,et al.  Differential DNA mismatch repair underlies mutation rate variation across the human genome , 2015, Nature.

[35]  Obi L. Griffith,et al.  Statistically identifying tumor suppressors and oncogenes from pan-cancer genome-sequencing data , 2015, Bioinform..

[36]  Mingming Jia,et al.  COSMIC: exploring the world's knowledge of somatic mutations in human cancer , 2014, Nucleic Acids Res..

[37]  D. Karolchik,et al.  The UCSC Genome Browser database: 2016 update , 2015, bioRxiv.

[38]  A. Gonzalez-Perez,et al.  OncodriveFML: a general framework to identify coding and non-coding regions with cancer driver mutations , 2016, Genome Biology.

[39]  Sourav Bandyopadhyay,et al.  Challenges in identifying cancer genes by analysis of exome sequencing data , 2016, Nature Communications.

[40]  Zhongming Zhao,et al.  Advances in computational approaches for prioritizing driver mutations and significantly mutated genes in cancer genomes , 2016, Briefings Bioinform..

[41]  Evan Bolton,et al.  Database resources of the National Center for Biotechnology Information , 2017, Nucleic Acids Res..