Functional Evolutionary Modeling Exposes Overlooked Protein-Coding Genes Involved in Cancer

Numerous computational methods have been developed to screening the genome for candidate driver genes based on genomic data of somatic mutations in tumors. Compiling a catalog of cancer genes has profound implications for the understanding and treatment of the disease. Existing methods make many implicit and explicit assumptions about the distribution of random mutations. We present FABRIC, a new framework for quantifying the evolutionary selection of genes by assessing the functional effects of mutations on protein-coding genes using a pre-trained machine-learning model. The framework compares the estimated effects of observed genetic variations against all possible single-nucleotide mutations in the coding human genome. Compared to existing methods, FABRIC makes minimal assumptions about the distribution of random mutations. To demonstrate its wide applicability, we applied FABRIC on both naturally occurring human variants and somatic mutations in cancer. In the context of cancer, ~3 M somatic mutations were extracted from over 10,000 cancerous human samples. Of the entire human proteome, 593 protein-coding genes show statistically significant bias towards harmful mutations. These genes, discovered without any prior knowledge, show an overwhelming overlap with contemporary cancer gene catalogs. Notably, the majority of these genes (426) are unlisted in these catalogs, but a substantial fraction of them is supported by literature. In the context of normal human evolution, we analyzed ~5 M common and rare variants from ~60 K individuals, discovering 6,288 significant genes. Over 98% of them are dominated by negative selection, supporting the notion of a strong purifying selection during the evolution of the healthy human population. We present the FABRIC framework as an open-source project with a simple command-line interface.

[1]  Steven J. M. Jones,et al.  Comprehensive Characterization of Cancer Driver Genes and Mutations , 2018, Cell.

[2]  K. Kinzler,et al.  Cancer Genome Landscapes , 2013, Science.

[3]  Michal Linial,et al.  ProFET: Feature engineering captures high-level protein functions , 2015, Bioinform..

[4]  I. Adzhubei,et al.  Predicting Functional Effect of Human Missense Mutations Using PolyPhen‐2 , 2013, Current protocols in human genetics.

[5]  S. Gabriel,et al.  Discovery and saturation analysis of cancer genes across 21 tumor types , 2014, Nature.

[6]  Jean-Michel Claverie,et al.  The human gene damage index as a gene-level approach to prioritizing exome variants , 2015, Proceedings of the National Academy of Sciences.

[7]  K. Kinzler,et al.  Evaluating the evaluation of cancer driver genes , 2016, Proceedings of the National Academy of Sciences.

[8]  A. Gonzalez-Perez,et al.  Improving the prediction of the functional impact of cancer mutations by baseline tolerance transformation , 2012, Genome Medicine.

[9]  J. Shendure,et al.  A general framework for estimating the relative pathogenicity of human genetic variants , 2014, Nature Genetics.

[10]  Michal Linial,et al.  ASAP: a machine learning framework for local protein properties , 2016, Database J. Biol. Databases Curation.

[11]  D. Goldstein,et al.  Genic Intolerance to Functional Variation and the Interpretation of Personal Genomes , 2013, PLoS genetics.

[12]  Nathan Linial,et al.  Quantifying gene selection in cancer through protein functional alteration bias , 2019, Nucleic acids research.

[13]  Jing Zhang,et al.  Identifying driver mutations from sequencing data of heterogeneous tumors in the era of personalized genome sequencing , 2014, Briefings Bioinform..

[14]  K. Tomczak,et al.  The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge , 2015, Contemporary oncology.

[15]  Daniel G. MacArthur,et al.  The ExAC browser: displaying reference data information from over 60 000 exomes , 2016, bioRxiv.

[16]  Ricardo Villamarín-Salomón,et al.  ClinVar: public archive of interpretations of clinically relevant variants , 2015, Nucleic Acids Res..

[17]  M. Stratton,et al.  A census of amplified and overexpressed human cancer genes , 2010, Nature Reviews Cancer.

[18]  Vivien Marx,et al.  Cancer genomes: discerning drivers from passengers , 2014, Nature Methods.

[19]  E. Birney,et al.  Patterns of somatic mutation in human cancer genomes , 2007, Nature.

[20]  A. Gonzalez-Perez,et al.  OncodriveFML: a general framework to identify coding and non-coding regions with cancer driver mutations , 2016, Genome Biology.