Prioritization Of Nonsynonymous Single Nucleotide Variants For Exome Sequencing Studies Via Integrative Learning On Multiple Genomic Data

The rapid advancement of next generation sequencing technology has greatly accelerated the progress for understanding human inherited diseases via such innovations as exome sequencing. Nevertheless, the identification of causative variants from sequencing data remains a great challenge. Traditional statistical genetics approaches such as linkage analysis and association studies have limited power in analyzing exome sequencing data, while relying on simply filtration strategies and predicted functional implications of mutations to pinpoint pathogenic variants are prone to produce false positives. To overcome these limitations, we herein propose a supervised learning approach, termed snvForest, to prioritize candidate nonsynonymous single nucleotide variants for a specific type of disease by integrating 11 functional scores at the variant level and 8 association scores at the gene level. We conduct a series of large-scale in silico validation experiments, demonstrating the effectiveness of snvForest across 2,511 diseases of different inheritance styles and the superiority of our approach over two state-of-the-art methods. We further apply snvForest to three real exome sequencing data sets of epileptic encephalophathies and intellectual disability to show the ability of our approach to identify causative de novo mutations for these complex diseases. The online service and standalone software of snvForest are found at http://bioinfo.au.tsinghua.edu.cn/jianglab/snvforest.

[1]  Baris E. Suzek,et al.  The Universal Protein Resource (UniProt) in 2010 , 2009, Nucleic Acids Res..

[2]  N. Campbell Genetic association database , 2004, Nature Reviews Genetics.

[3]  Alexander E. Kel,et al.  TRANSFAC®: transcriptional regulation, from patterns to profiles , 2003, Nucleic Acids Res..

[4]  María Martín,et al.  The Universal Protein Resource (UniProt) in 2010 , 2010 .

[5]  H. Hakonarson,et al.  ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data , 2010, Nucleic acids research.

[6]  C.-C. Jay Kuo,et al.  Sequence-based prioritization of nonsynonymous single-nucleotide polymorphisms for the study of disease mutations. , 2007, American journal of human genetics.

[7]  B. Snel,et al.  STRING: a web-server to retrieve and display the repeatedly occurring neighbourhood of a gene. , 2000, Nucleic acids research.

[8]  G. Abecasis,et al.  Rare-variant association analysis: study designs and statistical tests. , 2014, American journal of human genetics.

[9]  Bassem A. Hassan,et al.  Gene prioritization through genomic data fusion , 2006, Nature Biotechnology.

[10]  De novo mutations in epileptic encephalopathies , 2013 .

[11]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[12]  Cathy H. Wu,et al.  UniProt: the Universal Protein knowledgebase , 2004, Nucleic Acids Res..

[13]  D. Horn,et al.  Range of genetic mutations associated with severe non-syndromic sporadic intellectual disability: an exome sequencing study , 2012, The Lancet.

[14]  Damian Smedley,et al.  Improved exome prioritization of disease genes through cross-species phenotype comparison , 2014, Genome research.

[15]  Howard Y. Chang,et al.  Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position , 2013, Nature Methods.

[16]  Justin C. Fay,et al.  Identification of deleterious mutations within three human genomes. , 2009, Genome research.

[17]  Rui Jiang,et al.  Constructing a gene semantic similarity network for the inference of disease genes , 2011, BMC Systems Biology.

[18]  S. Batzoglou,et al.  Distribution and intensity of constraint in mammalian genomic sequence. , 2005, Genome research.

[19]  ENCODEConsortium,et al.  An Integrated Encyclopedia of DNA Elements in the Human Genome , 2012, Nature.

[20]  Elizabeth M. Smigielski,et al.  dbSNP: the NCBI database of genetic variation , 2001, Nucleic Acids Res..

[21]  Shamil R Sunyaev,et al.  Pooled association tests for rare variants in exon-resequencing studies. , 2010, American journal of human genetics.

[22]  M. G. Reese,et al.  A probabilistic disease-gene finder for personal genomes. , 2011, Genome research.

[23]  E. Birney,et al.  Pfam: the protein families database , 2013, Nucleic Acids Res..

[24]  D. Goldstein,et al.  Uncovering the roles of rare variants in common disease through whole-genome sequencing , 2010, Nature Reviews Genetics.

[25]  Alan R. Aronson,et al.  An overview of MetaMap: historical perspective and recent advances , 2010, J. Am. Medical Informatics Assoc..

[26]  C. Sander,et al.  Predicting the functional impact of protein mutations: application to cancer genomics , 2011, Nucleic acids research.

[27]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[28]  Rodrigo Lopez,et al.  PSI-Search: iterative HOE-reduced profile SSEARCH searching , 2012, Bioinform..

[29]  Data production leads,et al.  An integrated encyclopedia of DNA elements in the human genome , 2012 .

[30]  H. Arai,et al.  De novo EEF1A2 mutations in patients with characteristic facial features, intellectual disability, autistic behaviors and epilepsy , 2015, Clinical genetics.

[31]  E. Boerwinkle,et al.  dbNSFP v2.0: A Database of Human Non‐synonymous SNVs and Their Functional Predictions and Annotations , 2013, Human mutation.

[32]  M. Gerstein,et al.  RNA-Seq: a revolutionary tool for transcriptomics , 2009, Nature Reviews Genetics.

[33]  B. V. van Bon,et al.  Diagnostic exome sequencing in persons with severe intellectual disability. , 2012, The New England journal of medicine.

[34]  Bart De Moor,et al.  Problems with the nested granularity of feature domains in bioinformatics: the eXtasy case , 2015, BMC Bioinformatics.

[35]  Doron Betel,et al.  The microRNA.org resource: targets and expression , 2007, Nucleic Acids Res..

[36]  P. Stenson,et al.  Human Gene Mutation Database (HGMD , 2003 .

[37]  Rui Jiang,et al.  Integrating Multiple Genomic Data to Predict Disease-Causing Nonsynonymous Single Nucleotide Variants in Exome Sequencing Studies , 2014, PLoS genetics.

[38]  A. Gonzalez-Perez,et al.  Improving the assessment of the outcome of nonsynonymous SNVs with a consensus deleteriousness score, Condel. , 2011, American journal of human genetics.

[39]  J. Shendure,et al.  Exome sequencing as a tool for Mendelian disease gene discovery , 2011, Nature Reviews Genetics.

[40]  P. Ng,et al.  Phen-Gen: combining phenotype and genotype to analyze rare disorders , 2014, Nature Methods.

[41]  Bart De Moor,et al.  eXtasy: variant prioritization by genomic data fusion , 2013, Nature Methods.

[42]  Mingxin Gan Correlating Information Contents of Gene Ontology Terms to Infer Semantic Similarity of Gene Products , 2014, Comput. Math. Methods Medicine.

[43]  Xiaohui Xie,et al.  Identifying novel constrained elements by exploiting biased substitution patterns , 2009, Bioinform..

[44]  P. Stenson,et al.  Human Gene Mutation Database (HGMD®): 2003 update , 2003, Human mutation.

[45]  Naomichi Matsumoto,et al.  De Novo mutations in GNAO1, encoding a Gαo subunit of heterotrimeric G proteins, cause epileptic encephalopathy. , 2013, American journal of human genetics.

[46]  Jana Marie Schwarz,et al.  MutationTaster evaluates disease-causing potential of sequence alterations , 2010, Nature Methods.

[47]  S. Henikoff,et al.  Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm , 2009, Nature Protocols.

[48]  David Haussler,et al.  New Methods for Detecting Lineage-Specific Selection , 2006, RECOMB.

[49]  S. Batalov,et al.  A gene atlas of the mouse and human protein-encoding transcriptomes. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[50]  Cathy H. Wu,et al.  The Universal Protein Resource (UniProt) , 2004, Nucleic Acids Res..

[51]  Adam Kiezun,et al.  Exome sequencing and the genetic basis of complex traits , 2012, Nature Genetics.

[52]  Hiroyuki Ogata,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 1999, Nucleic Acids Res..

[53]  Michael R. Johnson,et al.  De novo mutations in the classic epileptic encephalopathies , 2013, Nature.

[54]  P. Bork,et al.  A method and server for predicting damaging missense mutations , 2010, Nature Methods.

[55]  Kenny Q. Ye,et al.  An integrated map of genetic variation from 1,092 human genomes , 2012, Nature.

[56]  J. Shendure,et al.  A general framework for estimating the relative pathogenicity of human genetic variants , 2014, Nature Genetics.

[57]  Nathan C. Sheffield,et al.  The accessible chromatin landscape of the human genome , 2012, Nature.

[58]  Mauno Vihinen,et al.  PON‐P: Integrated predictor for pathogenicity of missense variants , 2012, Human mutation.

[59]  Maria Jesus Martin,et al.  The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 , 2003, Nucleic Acids Res..

[60]  Ting Chen,et al.  Exploring functional variant discovery in non-coding regions with SInBaD , 2012, Nucleic acids research.

[61]  M. Daly,et al.  Guilt by association , 2000, Nature Genetics.

[62]  Alan F. Scott,et al.  Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders , 2002, Nucleic Acids Res..

[63]  John C. Wooley,et al.  A Primer on Metagenomics , 2010, PLoS Comput. Biol..

[64]  L. Holm,et al.  The Pfam protein families database , 2005, Nucleic Acids Res..