A gradient-boosting approach for filtering de novo mutations in parent-offspring trios

MOTIVATION Whole-genome and -exome sequencing on parent-offspring trios is a powerful approach to identifying disease-associated genes by detecting de novo mutations in patients. Accurate detection of de novo mutations from sequencing data is a critical step in trio-based genetic studies. Existing bioinformatic approaches usually yield high error rates due to sequencing artifacts and alignment issues, which may either miss true de novo mutations or call too many false ones, making downstream validation and analysis difficult. In particular, current approaches have much worse specificity than sensitivity, and developing effective filters to discriminate genuine from spurious de novo mutations remains an unsolved challenge. RESULTS In this article, we curated 59 sequence features in whole genome and exome alignment context which are considered to be relevant to discriminating true de novo mutations from artifacts, and then employed a machine-learning approach to classify candidates as true or false de novo mutations. Specifically, we built a classifier, named De Novo Mutation Filter (DNMFilter), using gradient boosting as the classification algorithm. We built the training set using experimentally validated true and false de novo mutations as well as collected false de novo mutations from an in-house large-scale exome-sequencing project. We evaluated DNMFilter's theoretical performance and investigated relative importance of different sequence features on the classification accuracy. Finally, we applied DNMFilter on our in-house whole exome trios and one CEU trio from the 1000 Genomes Project and found that DNMFilter could be coupled with commonly used de novo mutation detection approaches as an effective filtering approach to significantly reduce false discovery rate without sacrificing sensitivity. AVAILABILITY The software DNMFilter implemented using a combination of Java and R is freely available from the website at http://humangenome.duke.edu/software.

[1]  S. Lok,et al.  Increased exonic de novo mutation rate in individuals with schizophrenia , 2011, Nature Genetics.

[2]  H. Chipman,et al.  BART: Bayesian Additive Regression Trees , 2008, 0806.3286.

[3]  Michael R. Johnson,et al.  De novo mutations in the classic epileptic encephalopathies , 2013, Nature.

[4]  Jacob J. Michaelson,et al.  forestSV: structural variant discovery through statistical learning , 2012, Nature Methods.

[5]  Michael F. Walker,et al.  De novo mutations revealed by whole-exome sequencing are strongly associated with autism , 2012, Nature.

[6]  M. DePristo,et al.  A framework for variation discovery and genotyping using next-generation DNA sequencing data , 2011, Nature Genetics.

[7]  S. Levy,et al.  De novo gene mutations highlight patterns of genetic and neural complexity in schizophrenia , 2012, Nature Genetics.

[8]  Helga Thorvaldsdóttir,et al.  Integrative Genomics Viewer , 2011, Nature Biotechnology.

[9]  Evan T. Geller,et al.  Patterns and rates of exonic de novo mutations in autism spectrum disorders , 2012, Nature.

[10]  Lior Pachter,et al.  Identification and correction of systematic error in high-throughput sequence data , 2011, BMC Bioinformatics.

[11]  Gholamreza Haffari,et al.  Feature-based classifiers for somatic mutation detection in tumour–normal paired sequencing data , 2011, Bioinform..

[12]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[13]  Bradley P. Coe,et al.  Sporadic autism exomes reveal a highly interconnected protein network of de novo mutations , 2012, Nature.

[14]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[15]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[16]  D. Horn,et al.  Range of genetic mutations associated with severe non-syndromic sporadic intellectual disability: an exome sequencing study , 2012, The Lancet.

[17]  Aleksandar Milosavljevic,et al.  An integrative variant analysis suite for whole exome next-generation sequencing data , 2012, BMC Bioinformatics.

[18]  Christian Gilissen,et al.  Diagnostic exome sequencing in persons with severe intellectual disability. , 2012, New England Journal of Medicine.

[19]  Lilia M. Iakoucheva,et al.  Whole-Genome Sequencing in Autism Identifies Hot Spots for De Novo Germline Mutation , 2012, Cell.

[20]  Si Quang Le,et al.  SNP detection and genotyping from low-coverage sequencing data on multiple diploid samples. , 2011, Genome research.

[21]  M. DePristo,et al.  Variation in genome-wide mutation rates within and between human families , 2011, Nature Genetics.

[22]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[23]  Arthur Wuster,et al.  DeNovoGear: de novo indel and point mutation discovery and phasing , 2013, Nature Methods.

[24]  Christopher A. Miller,et al.  VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. , 2012, Genome research.

[25]  J. Veltman,et al.  De novo mutations in human genetic disease , 2012, Nature Reviews Genetics.

[26]  Joshua S. Paul,et al.  Genotype and SNP calling from next-generation sequencing data , 2011, Nature Reviews Genetics.

[27]  S. Levy,et al.  Exome sequencing supports a de novo mutational paradigm for schizophrenia , 2011, Nature Genetics.

[28]  J. Friedman Stochastic gradient boosting , 2002 .

[29]  Whitney Wooderchak-Donahue,et al.  A support vector machine for identification of single-nucleotide polymorphisms from next-generation sequencing data , 2013, Bioinform..

[30]  Wei Chen,et al.  A Likelihood-Based Framework for Variant Calling and De Novo Mutation Detection in Families , 2012, PLoS genetics.

[31]  G. Pesole,et al.  SVM2: an improved paired-end-based tool for the detection of small genomic structural variations using high-throughput single-genome resequencing data , 2012, Nucleic acids research.