InDel marker detection by integration of multiple softwares using machine learning techniques

BackgroundIn the biological experiments of soybean species, molecular markers are widely used to verify the soybean genome or construct its genetic map. Among a variety of molecular markers, insertions and deletions (InDels) are preferred with the advantages of wide distribution and high density at the whole-genome level. Hence, the problem of detecting InDels based on next-generation sequencing data is of great importance for the design of InDel markers. To tackle it, this paper integrated machine learning techniques with existing software and developed two algorithms for InDel detection, one is the best F-score method (BF-M) and the other is the Support Vector Machine (SVM) method (SVM-M), which is based on the classical SVM model.ResultsThe experimental results show that the performance of BF-M was promising as indicated by the high precision and recall scores, whereas SVM-M yielded the best performance in terms of recall and F-score. Moreover, based on the InDel markers detected by SVM-M from soybeans that were collected from 56 different regions, highly polymorphic loci were selected to construct an InDel marker database for soybean.ConclusionsCompared to existing software tools, the two algorithms proposed in this work produced substantially higher precision and recall scores, and remained stable in various types of genomic regions. Moreover, based on SVM-M, we have constructed a database for soybean InDel markers and published it for academic research.

[1]  H. Ellegren,et al.  Insertion-deletion polymorphisms (indels) as genetic markers in natural populations , 2008, BMC Genetics.

[2]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[3]  Mohammad Shabbir Hasan,et al.  Performance evaluation of indel calling tools using real short-read data , 2015, Human Genomics.

[4]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[5]  Yun Sung Cho,et al.  Variation block-based genomics method for crop plants , 2014, BMC Genomics.

[6]  Jacob J. Michaelson,et al.  forestSV: structural variant discovery through statistical learning , 2012, Nature Methods.

[7]  J. Schmutz,et al.  Developing market class specific InDel markers from next generation sequence data in Phaseolus vulgaris L. , 2013, Front. Plant Sci..

[8]  G. Pesole,et al.  SVM2: an improved paired-end-based tool for the detection of small genomic structural variations using high-throughput single-genome resequencing data , 2012, Nucleic acids research.

[9]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[10]  D. Nickerson,et al.  The utility of single nucleotide polymorphisms in inferences of population history , 2003 .

[11]  Jun Wang,et al.  Molecular footprints of domestication and improvement in soybean revealed by whole genome re-sequencing , 2013, BMC Genomics.

[12]  Kenneth H. Buetow,et al.  Bioinformatics Applications Note Sequence Analysis Bambino: a Variant Detector and Alignment Viewer for Next-generation Sequencing Data in the Sam/bam Format , 2022 .

[13]  Hugo Y. K. Lam,et al.  Detecting and annotating genetic variations using the HugeSeq pipeline , 2012, Nature Biotechnology.

[14]  Bradley P. Coe,et al.  Genome structural variation discovery and genotyping , 2011, Nature Reviews Genetics.

[15]  Christian Schlötterer,et al.  The evolution of molecular markers — just a matter of fashion? , 2004, Nature Reviews Genetics.

[16]  B. Faircloth,et al.  Primer3—new capabilities and interfaces , 2012, Nucleic acids research.

[17]  G. Luikart,et al.  SNPs in ecology, evolution and conservation , 2004 .

[18]  Zhen Yue,et al.  pIRS: Profile-based Illumina pair-end reads simulator , 2012, Bioinform..

[19]  Kai Ye,et al.  Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads , 2009, Bioinform..

[20]  John R. Walker,et al.  Identification of pathogen genomic variants through an integrated pipeline , 2014, BMC Bioinformatics.

[21]  R. Durbin,et al.  Dindel: accurate indel calls from short-read data. , 2011, Genome research.

[22]  Heng Li,et al.  A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data , 2011, Bioinform..

[23]  M. Gerstein,et al.  CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. , 2011, Genome research.

[24]  J. Miller,et al.  Predicting the Functional Effect of Amino Acid Substitutions and Indels , 2012, PloS one.

[25]  Brandi L. Cantarel,et al.  BAYSIC: a Bayesian method for combining sets of genome variants with improved specificity and sensitivity , 2014, BMC Bioinformatics.

[26]  Martin Vingron,et al.  Detecting genomic indel variants with exact breakpoints in single- and paired-end sequencing data using SplazerS , 2012, Bioinform..

[27]  Ken Chen,et al.  VarScan: variant detection in massively parallel sequencing of individual and pooled samples , 2009, Bioinform..

[28]  R. Wilson,et al.  BreakDancer: An algorithm for high resolution mapping of genomic structural variation , 2009, Nature Methods.

[29]  K. Borgwardt,et al.  Accurate indel prediction using paired-end short reads , 2013, BMC Genomics.

[30]  Hugo Y. K. Lam,et al.  Nucleotide-resolution analysis of structural variants using BreakSeq and a breakpoint library , 2010, Nature Biotechnology.

[31]  Yingrui Li,et al.  SOAPindel: Efficient identification of indels from short paired reads , 2013, Genome research.

[32]  Hong-Kyu Choi,et al.  Population Structure and Domestication Revealed by High-Depth Resequencing of Korean Cultivated and Wild Soybean Genomes , 2013, DNA research : an international journal for rapid publication of reports on genes and genomes.