Prediction of Rare Single-Nucleotide Causative Mutations for Muscular Diseases in Pooled Next-Generation Sequencing Experiments

Next-generation sequencing (NGS) is a new approach for biomedical research, useful for the diagnosis of genetic diseases in extremely heterogeneous conditions. In this work, we describe how data generated by high-throughput NGS experiments can be analyzed to find single nucleotide polymorphisms (SNPs) in DNA samples of patients affected by neuromuscular disorders. In particular, we consider untagged pooled NGS data, where DNA samples of different individuals are combined in a single experiment, still providing information with an uncertainty limited to only two patients. At the moment, only few publications address the problem of SNPs detection in pooled experiments, and existing tools are often inaccurate. We propose a computational procedure consisting of two parts. In the first, data are filtered by means of decision rules. The second phase is based on a supervised classification technique. In the present work, we compare different de facto standard supervised and unsupervised procedures to identify and classify variants potentially related to muscular diseases, and we discuss results in terms of statistical and biological validation.

[1]  M. Metzker Sequencing technologies — the next generation , 2010, Nature Reviews Genetics.

[2]  David S. Broomhead,et al.  Multivariable Functional Interpolation and Adaptive Networks , 1988, Complex Syst..

[3]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[4]  A. Futschik,et al.  The Next Generation of Molecular Markers From Massively Parallel Sequencing of Pooled DNA Samples , 2010, Genetics.

[5]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[6]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2011, Math. Program..

[7]  Steven Henikoff,et al.  SIFT: predicting amino acid changes that affect protein function , 2003, Nucleic Acids Res..

[8]  M. DePristo,et al.  A framework for variation discovery and genotyping using next-generation DNA sequencing data , 2011, Nature Genetics.

[9]  C. Angelini,et al.  Next-Generation Sequencing Identifies Transportin 3 as the Causative Gene for LGMD1F , 2013, PloS one.

[11]  D. Broomhead,et al.  Radial Basis Functions, Multi-Variable Functional Interpolation and Adaptive Networks , 1988 .

[12]  Ituro Inoue,et al.  Next-generation sequencing: impact of exome sequencing in characterizing Mendelian disorders , 2012, Journal of Human Genetics.

[13]  K. Hartmann,et al.  Reclosure of the Disrupted Laparotomy Wound: A Systematic Review , 2005, Obstetrics and gynecology.

[14]  Vikas Bansal,et al.  A statistical method for the detection of variants from next-generation resequencing of DNA pools , 2010, Bioinform..

[15]  M. Rivas,et al.  Nature Genetics Advance Online Publication High-throughput, Pooled Sequencing Identifies Mutations in Nubpl and Foxred1 in Human Complex I Deficiency , 2022 .

[16]  Panos M. Pardalos,et al.  A classification method based on generalized eigenvalue problems , 2007, Optim. Methods Softw..

[17]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[18]  F. Rivier,et al.  The 2017 version of the gene table of monogenic neuromuscular disorders (nuclear genome) , 2016, Neuromuscular Disorders.

[19]  Gholamreza Haffari,et al.  Feature-based classifiers for somatic mutation detection in tumour–normal paired sequencing data , 2011, Bioinform..

[20]  H. Hakonarson,et al.  SNVer: a statistical tool for variant calling in analysis of pooled or individual next-generation sequencing data , 2011, Nucleic acids research.

[21]  Franz Pernkopf,et al.  Bayesian network classifiers versus selective k-NN classifier , 2005, Pattern Recognit..

[22]  David R. Karger,et al.  Tackling the Poor Assumptions of Naive Bayes Text Classifiers , 2003, ICML.

[23]  Hanlee P. Ji,et al.  Next-generation DNA sequencing , 2008, Nature Biotechnology.

[24]  Eibe Frank,et al.  Logistic Model Trees , 2003, Machine Learning.

[25]  Panos M. Pardalos,et al.  Incremental Classification with Generalized Eigenvalues , 2007, J. Classif..

[26]  D. Hamroun,et al.  The 2013 version of the gene table of neuromuscular disorders (nuclear genome) , 2012, Neuromuscular Disorders.

[27]  D. Goldstein,et al.  Sequencing studies in human genetics: design and interpretation , 2013, Nature Reviews Genetics.

[28]  G. Piluso,et al.  Next generation sequencing (NGS) strategies for the genetic testing of myopathies , 2012, Acta myologica : myopathies and cardiomyopathies : official journal of the Mediterranean Society of Myology.

[29]  Christian Gilissen,et al.  Disease gene identification strategies for exome sequencing , 2012, European Journal of Human Genetics.

[30]  Eibe Frank,et al.  Speeding Up Logistic Model Tree Induction , 2005, PKDD.

[31]  Gabor T. Marth,et al.  Haplotype-based variant detection from short-read sequencing , 2012, 1207.3907.

[32]  Nir Friedman,et al.  Bayesian Network Classifiers , 1997, Machine Learning.

[33]  J. L. Hodges,et al.  Discriminatory Analysis - Nonparametric Discrimination: Consistency Properties , 1989 .

[34]  Olvi L. Mangasarian,et al.  Multisurface proximal support vector machine classification via generalized eigenvalues , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.