Integrating Multiple Genomic Data to Predict Disease-Causing Nonsynonymous Single Nucleotide Variants in Exome Sequencing Studies

Exome sequencing has been widely used in detecting pathogenic nonsynonymous single nucleotide variants (SNVs) for human inherited diseases. However, traditional statistical genetics methods are ineffective in analyzing exome sequencing data, due to such facts as the large number of sequenced variants, the presence of non-negligible fraction of pathogenic rare variants or de novo mutations, and the limited size of affected and normal populations. Indeed, prevalent applications of exome sequencing have been appealing for an effective computational method for identifying causative nonsynonymous SNVs from a large number of sequenced variants. Here, we propose a bioinformatics approach called SPRING (Snv PRioritization via the INtegration of Genomic data) for identifying pathogenic nonsynonymous SNVs for a given query disease. Based on six functional effect scores calculated by existing methods (SIFT, PolyPhen2, LRT, MutationTaster, GERP and PhyloP) and five association scores derived from a variety of genomic data sources (gene ontology, protein-protein interactions, protein sequences, protein domain annotations and gene pathway annotations), SPRING calculates the statistical significance that an SNV is causative for a query disease and hence provides a means of prioritizing candidate SNVs. With a series of comprehensive validation experiments, we demonstrate that SPRING is valid for diseases whose genetic bases are either partly known or completely unknown and effective for diseases with a variety of inheritance styles. In applications of our method to real exome sequencing data sets, we show the capability of SPRING in detecting causative de novo mutations for autism, epileptic encephalopathies and intellectual disability. We further provide an online service, the standalone software and genome-wide predictions of causative SNVs for 5,080 diseases at http://bioinfo.au.tsinghua.edu.cn/spring.

[1]  S. Levy,et al.  Exome sequencing supports a de novo mutational paradigm for schizophrenia , 2011, Nature Genetics.

[2]  J. Moult,et al.  Identification and analysis of deleterious human SNPs. , 2006, Journal of molecular biology.

[3]  L. Vissers,et al.  De novo nonsense mutations in ASXL1 cause Bohring-Opitz syndrome , 2011, Nature Genetics.

[4]  A. Becker,et al.  Molecular correlates of age-dependent seizures in an inherited neonatal-infantile epilepsy. , 2010, Brain : a journal of neurology.

[5]  Bradley P. Coe,et al.  Sporadic autism exomes reveal a highly interconnected protein network of de novo mutations , 2012, Nature.

[6]  P. Shannon,et al.  Exome sequencing identifies the cause of a Mendelian disorder , 2009, Nature Genetics.

[7]  O. Rosmorduc,et al.  MDR3 gene defect in adults with symptomatic intrahepatic and gallbladder cholesterol cholelithiasis. , 2001, Gastroenterology.

[8]  Christian Gilissen,et al.  De novo mutations of SETBP1 cause Schinzel-Giedion syndrome , 2010, Nature Genetics.

[9]  R. Jiang,et al.  Prediction of Deleterious Nonsynonymous Single-Nucleotide Polymorphism for Human Diseases , 2013, TheScientificWorldJournal.

[10]  D. Wieczorek,et al.  A mutation screen in patients with Kabuki syndrome , 2011, Human Genetics.

[11]  I. Scheffer,et al.  Benign familial neonatal‐infantile seizures: Characterization of a new sodium channelopathy , 2004, Annals of neurology.

[12]  C.-C. Jay Kuo,et al.  Sequence-based prioritization of nonsynonymous single-nucleotide polymorphisms for the study of disease mutations. , 2007, American journal of human genetics.

[13]  S. Levy,et al.  De novo gene mutations highlight patterns of genetic and neural complexity in schizophrenia , 2012, Nature Genetics.

[14]  C. Tyler-Smith,et al.  Deleterious- and disease-allele prevalence in healthy individuals: insights from current predictions, mutation databases, and population-scale resequencing. , 2012, American journal of human genetics.

[15]  Bassem A. Hassan,et al.  Gene prioritization through genomic data fusion , 2006, Nature Biotechnology.

[16]  K. Yamakawa,et al.  A Nonsense Mutation of the Sodium Channel Gene SCN2A in a Patient with Intractable Epilepsy and Mental Decline , 2004, The Journal of Neuroscience.

[17]  J. Shendure,et al.  De novo mutations in the actin genes ACTB and ACTG1 cause Baraitser-Winter syndrome , 2012, Nature Genetics.

[18]  Johnny S. H. Kwan,et al.  Predicting Mendelian Disease-Causing Non-Synonymous Single Nucleotide Variants in Exome Sequencing Studies , 2013, PLoS genetics.

[19]  B. V. van Bon,et al.  Diagnostic exome sequencing in persons with severe intellectual disability. , 2012, The New England journal of medicine.

[20]  W. Bodmer,et al.  Common and rare variants in multifactorial susceptibility to common diseases , 2008, Nature Genetics.

[21]  Susumu Goto,et al.  The KEGG resource for deciphering the genome , 2004, Nucleic Acids Res..

[22]  M. McCarthy,et al.  Heterozygous MDR3 missense mutation associated with intrahepatic cholestasis of pregnancy: evidence for a defect in protein trafficking. , 2000, Human molecular genetics.

[23]  Hua Yang,et al.  Searching for interpretable rules for disease mutations: a simulated annealing bump hunting strategy , 2006, BMC Bioinformatics.

[24]  J. Prieto,et al.  A multidrug resistance 3 gene mutation causing cholelithiasis, cholestasis of pregnancy, and adulthood biliary cirrhosis. , 2003, Gastroenterology.

[25]  G. Vriend,et al.  A text-mining analysis of the human phenome , 2006, European Journal of Human Genetics.

[26]  Ting Chen,et al.  Exploring functional variant discovery in non-coding regions with SInBaD , 2012, Nucleic acids research.

[27]  Xuegong Zhang,et al.  Identifying potential cancer driver genes by genomic data integration , 2013, Scientific Reports.

[28]  J. Shendure,et al.  Exome sequencing as a tool for Mendelian disease gene discovery , 2011, Nature Reviews Genetics.

[29]  M. Daly,et al.  Guilt by association , 2000, Nature Genetics.

[30]  E. Oka,et al.  Significant correlation of the SCN1A mutations and severe myoclonic epilepsy in infancy. , 2002, Biochemical and biophysical research communications.

[31]  S. Petrou,et al.  SCN2A mutation associated with neonatal epilepsy, late-onset episodic ataxia, myoclonus, and pain , 2010, Neurology.

[32]  Evan T. Geller,et al.  Patterns and rates of exonic de novo mutations in autism spectrum disorders , 2012, Nature.

[33]  Alan F. Scott,et al.  Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders , 2002, Nucleic Acids Res..

[34]  D. Horn,et al.  Range of genetic mutations associated with severe non-syndromic sporadic intellectual disability: an exome sequencing study , 2012, The Lancet.

[35]  S. Lok,et al.  Increased exonic de novo mutation rate in individuals with schizophrenia , 2011, Nature Genetics.

[36]  L. Holm,et al.  The Pfam protein families database , 2005, Nucleic Acids Res..

[37]  James J. Yang Distribution of Fisher's combination statistic when the tests are dependent , 2010 .

[38]  M. Rieder,et al.  Exome sequencing in sporadic autism spectrum disorders identifies severe de novo mutations , 2011, Nature Genetics.

[39]  I. Tikhonova,et al.  Genetic diagnosis by whole exome capture and massively parallel DNA sequencing , 2009, Proceedings of the National Academy of Sciences.

[40]  De novo mutations in epileptic encephalopathies , 2013 .

[41]  Kenny Q. Ye,et al.  De Novo Gene Disruptions in Children on the Autistic Spectrum , 2012, Neuron.

[42]  Junjun Zhang,et al.  BioMart Central Portal—unified access to biological data , 2009, Nucleic Acids Res..

[43]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[44]  M. Kubota,et al.  Paternal mosaicism of an STXBP1 mutation in OS , 2011, Clinical Genetics.

[45]  C. Philippe,et al.  Impairment of CDKL5 nuclear localisation as a cause for severe infantile encephalopathy , 2007, Journal of Medical Genetics.

[46]  W. Pearson Searching protein sequence libraries: comparison of the sensitivity and selectivity of the Smith-Waterman and FASTA algorithms. , 1991, Genomics.

[47]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[48]  P. Robinson,et al.  The Human Phenotype Ontology: a tool for annotating and analyzing human hereditary disease. , 2008, American journal of human genetics.

[49]  B. Rost,et al.  SNAP: predict effect of non-synonymous polymorphisms on function , 2007, Nucleic acids research.

[50]  E. Boerwinkle,et al.  dbNSFP: A Lightweight Database of Human Nonsynonymous SNPs and Their Functional Predictions , 2011, Human mutation.

[51]  N. Bresolin,et al.  A novel mutation in KCNQ2 associated with BFNC, drug resistant epilepsy, and mental retardation , 2004, Neurology.

[52]  M. McCarthy,et al.  ABCB4 gene sequence variation in women with intrahepatic cholestasis of pregnancy , 2003, Journal of medical genetics.

[53]  J. Gilbert,et al.  An X chromosome-wide association study in autism families identifies TBL1X as a novel autism spectrum disorder candidate gene in males , 2011, Molecular autism.

[54]  John D. Storey A direct approach to false discovery rates , 2002 .

[55]  Philip Resnik,et al.  Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language , 1999, J. Artif. Intell. Res..

[56]  J. Shendure,et al.  Needles in stacks of needles: finding disease-causal variants in a wealth of genomic data , 2011, Nature Reviews Genetics.

[57]  Qifang Liu,et al.  Align human interactome with phenome to identify causative genes and networks underlying disease families , 2009, Bioinform..

[58]  K. Veeramah,et al.  De novo pathogenic SCN8A mutation identified by whole-genome sequencing of a family quartet affected by infantile epileptic encephalopathy and SUDEP. , 2012, American journal of human genetics.

[59]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[60]  Toshihiro Tanaka The International HapMap Project , 2003, Nature.

[61]  Jason Y. Liu,et al.  Analysis of protein sequence and interaction data for candidate disease gene prediction , 2006, Nucleic acids research.

[62]  Michael F. Walker,et al.  De novo mutations revealed by whole-exome sequencing are strongly associated with autism , 2012, Nature.

[63]  R. Fisher Statistical methods for research workers , 1927, Protoplasma.

[64]  P. Robinson,et al.  Walking the interactome for prioritization of candidate disease genes. , 2008, American journal of human genetics.

[65]  Rui Jiang,et al.  Constructing a gene semantic similarity network for the inference of disease genes , 2011, BMC Systems Biology.

[66]  S. Batzoglou,et al.  Distribution and intensity of constraint in mammalian genomic sequence. , 2005, Genome research.

[67]  M. Meisler,et al.  Sodium channel mutations in epilepsy and other neurological disorders. , 2005, Journal of Clinical Investigation.

[68]  Christian Gilissen,et al.  A de novo paradigm for mental retardation , 2010, Nature Genetics.

[69]  Emily H Turner,et al.  Targeted Capture and Massively Parallel Sequencing of Twelve Human Exomes , 2009, Nature.

[70]  P. Bosma,et al.  The wide spectrum of multidrug resistance 3 deficiency: from neonatal cholestasis to cirrhosis of adulthood. , 2001, Gastroenterology.

[71]  B. Snel,et al.  STRING: a web-server to retrieve and display the repeatedly occurring neighbourhood of a gene. , 2000, Nucleic acids research.

[72]  Justin C. Fay,et al.  Identification of deleterious mutations within three human genomes. , 2009, Genome research.

[73]  N. Campbell Genetic association database , 2004, Nature Reviews Genetics.

[74]  Elizabeth M. Smigielski,et al.  dbSNP: the NCBI database of genetic variation , 2001, Nucleic Acids Res..

[75]  John D. Storey The positive false discovery rate: a Bayesian interpretation and the q-value , 2003 .

[76]  C Eng,et al.  Subset of individuals with autism spectrum disorders and extreme macrocephaly associated with germline PTEN tumour suppressor gene mutations , 2005, Journal of Medical Genetics.

[77]  E. Birney,et al.  Pfam: the protein families database , 2013, Nucleic Acids Res..

[78]  P. Meier,et al.  Sequence analysis of bile salt export pump (ABCB11) and multidrug resistance p-glycoprotein 3 (ABCB4, MDR3) in patients with intrahepatic cholestasis of pregnancy. , 2004, Pharmacogenetics.

[79]  Jana Marie Schwarz,et al.  MutationTaster evaluates disease-causing potential of sequence alterations , 2010, Nature Methods.

[80]  S. Henikoff,et al.  Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm , 2009, Nature Protocols.

[81]  David Haussler,et al.  New Methods for Detecting Lineage-Specific Selection , 2006, RECOMB.

[82]  Cathy H. Wu,et al.  The Universal Protein Resource (UniProt) , 2004, Nucleic Acids Res..

[83]  P. Boelle,et al.  ABCB4 gene mutation-associated cholelithiasis in adults. , 2003, Gastroenterology.

[84]  P. Bork,et al.  A method and server for predicting damaging missense mutations , 2010, Nature Methods.

[85]  L. Weiss,et al.  Sodium channels SCN1A, SCN2A and SCN3A in familial autism , 2003, Molecular Psychiatry.

[86]  John D. Lafferty,et al.  Diffusion Kernels on Graphs and Other Discrete Input Spaces , 2002, ICML.