A general framework for estimating the relative pathogenicity of human genetic variants

Current methods for annotating and interpreting human genetic variation tend to exploit a single information type (for example, conservation) and/or are restricted in scope (for example, to missense changes). Here we describe Combined Annotation–Dependent Depletion (CADD), a method for objectively integrating many diverse annotations into a single measure (C score) for each variant. We implement CADD as a support vector machine trained to differentiate 14.7 million high-frequency human-derived alleles from 14.7 million simulated variants. We precompute C scores for all 8.6 billion possible human single-nucleotide variants and enable scoring of short insertions-deletions. C scores correlate with allelic diversity, annotations of functionality, pathogenicity, disease severity, experimentally measured regulatory effects and complex trait associations, and they highly rank known pathogenic variants within individual genomes. The ability of CADD to prioritize functional, deleterious and pathogenic variants across many functional categories, effect sizes and genetic architectures is unmatched by any current single-annotation method.

[1]  R. Grantham Amino Acid Difference Formula to Help Explain Protein Evolution , 1974, Science.

[2]  M. Kimura,et al.  The neutral theory of molecular evolution. , 1983, Scientific American.

[3]  John M. Chambers,et al.  Graphical Methods for Data Analysis , 1983 .

[4]  S. Tavaré Some probabilistic and statistical problems in the analysis of DNA sequences , 1986 .

[5]  P Green,et al.  Base-calling of automated sequencer traces using phred. II. Error probabilities. , 1998, Genome research.

[6]  P. Green,et al.  Base-calling of automated sequencer traces using phred. I. Accuracy assessment. , 1998, Genome research.

[7]  T. Jukes,et al.  The neutral theory of molecular evolution. , 2000, Genetics.

[8]  Steven Henikoff,et al.  SIFT: predicting amino acid changes that affect protein function , 2003, Nucleic Acids Res..

[9]  S. Batzoglou,et al.  Distribution and intensity of constraint in mammalian genomic sequence. , 2005, Genome research.

[10]  D. Haussler,et al.  Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. , 2005, Genome research.

[11]  Webb Miller,et al.  HbVar database of human hemoglobin variants and thalassemia mutations: 2007 update , 2007, Human mutation.

[12]  A. Mortazavi,et al.  Genome-Wide Mapping of in Vivo Protein-DNA Interactions , 2007, Science.

[13]  David Haussler,et al.  The UCSC genome browser database: update 2007 , 2006, Nucleic Acids Res..

[14]  E. Birney,et al.  Enredo and Pecan: genome-wide mammalian consistency-based multiple alignment with paralogs. , 2008, Genome research.

[15]  Z. Weng,et al.  High-Resolution Mapping and Characterization of Open Chromatin across the Genome , 2008, Cell.

[16]  E. Birney,et al.  Genome-wide nucleotide-level mammalian ancestor reconstruction. , 2008, Genome research.

[17]  Jianzhi Zhang,et al.  Null mutations in human and mouse orthologs frequently result in different phenotypes , 2008, Proceedings of the National Academy of Sciences.

[18]  Sören Sonnenburg,et al.  Optimized Cutting Plane Algorithm for Large-Scale Risk Minimization , 2009, J. Mach. Learn. Res..

[19]  P. Stenson,et al.  The Human Gene Mutation Database: 2008 update , 2009, Genome Medicine.

[20]  P. Green,et al.  Widespread Genomic Signatures of Natural Selection in Hominid Evolution , 2009, PLoS genetics.

[21]  F. Collins,et al.  Potential etiologic and functional implications of genome-wide association loci for human diseases and traits , 2009, Proceedings of the National Academy of Sciences.

[22]  Jay Shendure,et al.  High-resolution analysis of DNA regulatory elements by synthetic saturation mutagenesis , 2009, Nature Biotechnology.

[23]  Emily H Turner,et al.  Targeted Capture and Massively Parallel Sequencing of Twelve Human Exomes , 2009, Nature.

[24]  P. Bork,et al.  A method and server for predicting damaging missense mutations , 2010, Nature Methods.

[25]  Jay Shendure,et al.  Single-nucleotide evolutionary constraint scores highlight disease-causing mutations , 2010, Nature Methods.

[26]  Olle Melander,et al.  From noncoding variant to phenotype via SORT1 at the 1p13 cholesterol locus , 2010, Nature.

[27]  K. Pollard,et al.  Detection of nonneutral substitution rates on mammalian phylogenies. , 2010, Genome research.

[28]  Serafim Batzoglou,et al.  Identifying a High Fraction of the Human Genome to be under Selective Constraint Using GERP++ , 2010, PLoS Comput. Biol..

[29]  N. Cox,et al.  Trait-Associated SNPs Are More Likely to Be eQTLs: Annotation to Enhance Discovery from GWAS , 2010, PLoS genetics.

[30]  Ting Wang,et al.  ENCODE whole-genome data in the UCSC Genome Browser , 2009, Nucleic Acids Res..

[31]  Emily H Turner,et al.  Exome sequencing identifies MLL2 mutations as a cause of Kabuki syndrome , 2010, Nature Genetics.

[32]  Daniel Rios,et al.  Bioinformatics Applications Note Databases and Ontologies Deriving the Consequences of Genomic Variants with the Ensembl Api and Snp Effect Predictor , 2022 .

[33]  Mary Goldman,et al.  The UCSC Genome Browser database: update 2011 , 2010, Nucleic Acids Res..

[34]  Adam C. Siepel,et al.  PHAST and RPHAST: phylogenetic analysis with space/time models , 2011, Briefings Bioinform..

[35]  J. Shendure,et al.  Needles in stacks of needles: finding disease-causal variants in a wealth of genomic data , 2011, Nature Reviews Genetics.

[36]  A. Gonzalez-Perez,et al.  Improving the assessment of the outcome of nonsynonymous SNVs with a consensus deleteriousness score, Condel. , 2011, American journal of human genetics.

[37]  E. Boerwinkle,et al.  dbNSFP: A Lightweight Database of Human Nonsynonymous SNPs and Their Functional Predictions , 2011, Human mutation.

[38]  Gregory M. Cooper,et al.  A Copy Number Variation Morbidity Map of Developmental Delay , 2011, Nature Genetics.

[39]  M. Rieder,et al.  Exome sequencing in sporadic autism spectrum disorders identifies severe de novo mutations , 2011, Nature Genetics.

[40]  D. Haussler,et al.  ENCODE whole-genome data in the UCSC Genome Browser: update 2012 , 2011, Nucleic Acids Res..

[41]  B. V. van Bon,et al.  Diagnostic exome sequencing in persons with severe intellectual disability. , 2012, The New England journal of medicine.

[42]  Kenny Q. Ye,et al.  De Novo Gene Disruptions in Children on the Autistic Spectrum , 2012, Neuron.

[43]  Michael F. Walker,et al.  De novo mutations revealed by whole-exome sequencing are strongly associated with autism , 2012, Nature.

[44]  Manolis Kellis,et al.  Interpreting noncoding genetic variation in complex traits and human disease , 2012, Nature Biotechnology.

[45]  D. Horn,et al.  Range of genetic mutations associated with severe non-syndromic sporadic intellectual disability: an exome sequencing study , 2012, The Lancet.

[46]  Jacob A. Tennessen,et al.  Evolution and Functional Impact of Rare Coding Variation from Deep Sequencing of Human Exomes , 2012, Science.

[47]  Bradley P. Coe,et al.  Sporadic autism exomes reveal a highly interconnected protein network of de novo mutations , 2012, Nature.

[48]  David Z. Chen,et al.  Architecture of the human regulatory network derived from ENCODE data , 2012, Nature.

[49]  D. Reich,et al.  Cost-effective, high-throughput DNA sequencing libraries for multiplexed target capture , 2012, Genome research.

[50]  Joseph B Hiatt,et al.  Massively parallel functional dissection of mammalian enhancers in vivo , 2012, Nature Biotechnology.

[51]  Data production leads,et al.  An integrated encyclopedia of DNA elements in the human genome , 2012 .

[52]  Adrian W. Briggs,et al.  A High-Coverage Genome Sequence from an Archaic Denisovan Individual , 2012, Science.

[53]  S. Batzoglou,et al.  Linking disease associations with regulatory information in the human genome , 2012, Genome research.

[54]  Kenny Q. Ye,et al.  An integrated map of genetic variation from 1,092 human genomes , 2012, Nature.

[55]  ENCODEConsortium,et al.  An Integrated Encyclopedia of DNA Elements in the Human Genome , 2012, Nature.

[56]  Joseph K. Pickrell,et al.  A Systematic Survey of Loss-of-Function Variants in Human Protein-Coding Genes , 2012, Science.

[57]  William Stafford Noble,et al.  Unsupervised pattern discovery in human chromatin structure through genomic segmentation , 2012, Nature Methods.

[58]  Evan T. Geller,et al.  Patterns and rates of exonic de novo mutations in autism spectrum disorders , 2012, Nature.

[59]  Monya Baker,et al.  One-stop shop for disease genes , 2012, Nature.

[60]  Mary Goldman,et al.  The UCSC Genome Browser database: extensions and updates 2011 , 2011, Nucleic Acids Res..

[61]  Ilan Gronau,et al.  Genome-wide inference of natural selection on human transcription factor binding sites , 2013, Nature Genetics.

[62]  S. Gabriel,et al.  Analysis of 6,515 exomes reveals a recent origin of most human protein-coding variants , 2012, Nature.

[63]  Mary Goldman,et al.  The UCSC Genome Browser database: extensions and updates 2013 , 2012, Nucleic Acids Res..

[64]  Mark Gerstein,et al.  Interpretation of Genomic Variants Using a Unified Biological Network Approach , 2013, PLoS Comput. Biol..

[65]  A. Hoischen,et al.  MLL2 mutation detection in 86 patients with Kabuki syndrome: a genotype–phenotype study , 2013, Clinical genetics.

[66]  R. Reading,et al.  Diagnostic exome sequencing in persons with severe intellectual disability , 2013 .

[67]  Anna Murray,et al.  Recessive mutations in a distal PTF1A enhancer cause isolated pancreatic agenesis , 2013, Nature Genetics.