SICaRiO: short indel call filtering with boosting.

Despite impressive improvement in the next-generation sequencing technology, reliable detection of indels is still a difficult endeavour. Recognition of true indels is of prime importance in many applications, such as personalized health care, disease genomics and population genetics. Recently, advanced machine learning techniques have been successfully applied to classification problems with large-scale data. In this paper, we present SICaRiO, a gradient boosting classifier for the reliable detection of true indels, trained with the gold-standard dataset from 'Genome in a Bottle' (GIAB) consortium. Our filtering scheme significantly improves the performance of each variant calling pipeline used in GIAB and beyond. SICaRiO uses genomic features that can be computed from publicly available resources, i.e. it does not require sequencing pipeline-specific information (e.g. read depth). This study also sheds lights on prior genomic contexts responsible for the erroneous calling of indels made by sequencing pipelines. We have compared prediction difficulty for three categories of indels over different sequencing pipelines. We have also ranked genomic features according to their predictivity in determining false positives.

[1]  Yingrui Li,et al.  SOAPindel: Efficient identification of indels from short paired reads , 2013, Genome research.

[2]  Alexa B. R. McIntyre,et al.  Extensive sequencing of seven human genomes to characterize benchmark reference materials , 2015, Scientific Data.

[3]  Minoru Kanehisa,et al.  KEGG as a reference resource for gene and protein annotation , 2015, Nucleic Acids Res..

[4]  R. Wilson,et al.  The Next-Generation Sequencing Revolution and Its Impact on Genomics , 2013, Cell.

[5]  Isotta Chimenti,et al.  The Potential of GMP-Compliant Platelet Lysate to Induce a Permissive State for Cardiovascular Transdifferentiation in Human Mediastinal Adipose Tissue-Derived Mesenchymal Stem Cells , 2015, BioMed research international.

[6]  R. Durbin,et al.  Dindel: accurate indel calls from short-read data. , 2011, Genome research.

[7]  Wolfgang Losert,et al.  svclassify: a method to establish benchmark structural variant calls , 2015, BMC Genomics.

[8]  Ruibang Luo,et al.  A multi-task convolutional deep neural network for variant calling in single molecule sequencing , 2019, Nature Communications.

[9]  John G. Cleary,et al.  Comparing Variant Call Files for Performance Benchmarking of Next-Generation Sequencing Variant Calling Pipelines , 2015, bioRxiv.

[10]  Ruibang Luo,et al.  Exploring the limit of using a deep neural network on pileup data for germline variant calling , 2020, Nature Machine Intelligence.

[11]  Yufeng Wu,et al.  GINDEL: Accurate Genotype Calling of Insertions and Deletions from Low Coverage Population Sequence Reads , 2014, PloS one.

[12]  Jessica C. Ebert,et al.  Accurate whole genome sequencing and haplotyping from10-20 human cells , 2012, Nature.

[13]  D. Karolchik,et al.  The UCSC Genome Browser database: 2016 update , 2015, bioRxiv.

[14]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[15]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[16]  Heng Li,et al.  A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data , 2011, Bioinform..

[17]  A. Wilm,et al.  LoFreq: a sequence-quality aware, ultra-sensitive variant caller for uncovering cell-population heterogeneity from high-throughput sequencing datasets , 2012, Nucleic acids research.

[18]  Ryan E. Mills,et al.  Small insertions and deletions (INDELs) in human genomes. , 2010, Human molecular genetics.

[19]  Jacob J. Michaelson,et al.  forestSV: structural variant discovery through statistical learning , 2012, Nature Methods.

[20]  Joachim Weischenfeldt,et al.  SvABA: genome-wide detection of structural variants and indels by local assembly , 2018, Genome research.

[21]  Birgit Funke,et al.  Best practices for benchmarking germline small-variant calls in human genomes , 2019, Nature Biotechnology.

[22]  Avanti Shrikumar,et al.  Learning Important Features Through Propagating Activation Differences , 2017, ICML.

[23]  Hugo Y. K. Lam,et al.  An ensemble approach to accurately detect somatic mutations using SomaticSeq , 2015, Genome Biology.

[24]  M. DePristo,et al.  Deep learning of genomic variation and regulatory network data. , 2018, Human molecular genetics.

[25]  G. Pesole,et al.  SVM2: an improved paired-end-based tool for the detection of small genomic structural variations using high-throughput single-genome resequencing data , 2012, Nucleic acids research.

[26]  Sébastien Tempel Using and understanding RepeatMasker. , 2012, Methods in molecular biology.

[27]  Shengwu Xiong,et al.  InDel marker detection by integration of multiple softwares using machine learning techniques , 2016, BMC Bioinformatics.

[28]  H. Hakonarson,et al.  SNVer: a statistical tool for variant calling in analysis of pooled or individual next-generation sequencing data , 2011, Nucleic acids research.

[29]  In-Hee Lee,et al.  Reducing False‐Positive Incidental Findings with Ensemble Genotyping and Logistic Regression Based Variant Filtering Methods , 2014, Human mutation.

[30]  Sarah Sandmann,et al.  Evaluating Variant Calling Tools for Non-Matched Next-Generation Sequencing Data , 2017, Scientific Reports.

[31]  Thomas Colthurst,et al.  A universal SNP and small-indel variant caller using deep neural networks , 2018, Nature Biotechnology.

[32]  Bradley P. Coe,et al.  Genome structural variation discovery and genotyping , 2011, Nature Reviews Genetics.

[33]  Steven J. M. Jones,et al.  A somatic reference standard for cancer genome sequencing , 2016, Scientific Reports.

[34]  Arthur Wuster,et al.  DeNovoGear: de novo indel and point mutation discovery and phasing , 2013, Nature Methods.

[35]  Vivian G. Cheung,et al.  Genetics of human gene expression: mapping DNA variants that influence gene expression , 2009, Nature Reviews Genetics.

[36]  Chittibabu Guda,et al.  A Comparison of Variant Calling Pipelines Using Genome in a Bottle as a Reference , 2015, BioMed research international.

[37]  Ryan E. Mills,et al.  An initial map of insertion and deletion (INDEL) variation in the human genome. , 2006, Genome research.

[38]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[39]  Mingxia Zhang,et al.  Analysis of the Antigenic Properties of Membrane Proteins of Mycobacterium tuberculosis , 2019, Scientific Reports.

[40]  G. McVean,et al.  Integrating mapping-, assembly- and haplotype-based approaches for calling variants in clinical sequencing applications , 2014, Nature Genetics.

[41]  Data production leads,et al.  An integrated encyclopedia of DNA elements in the human genome , 2012 .

[42]  James Y. Zou Analysis of protein-coding genetic variation in 60,706 humans , 2015, Nature.

[43]  Kai Ye,et al.  Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads , 2009, Bioinform..

[44]  Chunlin Xiao,et al.  Reproducible integration of multiple sequencing datasets to form high-confidence SNP, indel, and reference calls for five human genome reference materials , 2018, bioRxiv.

[45]  O. Hofmann,et al.  VarDict: a novel and versatile variant caller for next-generation sequencing in cancer research , 2016, Nucleic acids research.

[46]  Ewa A. Bergmann,et al.  Indel variant analysis of short-read sequencing data with Scalpel , 2015, Nature Protocols.

[47]  Bronwen L. Aken,et al.  GENCODE: The reference human genome annotation for The ENCODE Project , 2012, Genome research.

[48]  Serafim Batzoglou,et al.  Identifying a High Fraction of the Human Genome to be under Selective Constraint Using GERP++ , 2010, PLoS Comput. Biol..

[49]  Joel Gelernter,et al.  The Role and Challenges of Exome Sequencing in Studies of Human Diseases , 2013, Front. Genet..

[50]  M. Huss,et al.  A primer on deep learning in genomics , 2018, Nature Genetics.

[51]  Martin Vingron,et al.  Detecting genomic indel variants with exact breakpoints in single- and paired-end sequencing data using SplazerS , 2012, Bioinform..

[52]  Ken Chen,et al.  VarScan: variant detection in massively parallel sequencing of individual and pooled samples , 2009, Bioinform..

[53]  H. Ellegren,et al.  Insertion-deletion polymorphisms (indels) as genetic markers in natural populations , 2008, BMC Genetics.

[54]  Chun Hang Au,et al.  INDELseek: detection of complex insertions and deletions from next-generation sequencing data , 2017, BMC Genomics.

[55]  Mohammad Shabbir Hasan,et al.  Performance evaluation of indel calling tools using real short-read data , 2015, Human Genomics.

[56]  H. Swerdlow,et al.  A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers , 2012, BMC Genomics.

[57]  Hongbin Zhong,et al.  Systematic comparison of germline variant calling pipelines cross multiple next-generation sequencers , 2019, Scientific Reports.

[58]  Yadong Wang,et al.  A gradient-boosting approach for filtering de novo mutations in parent-offspring trios , 2014, Bioinform..

[59]  Viola Ravasio,et al.  GARFIELD-NGS: Genomic vARiants FIltering by dEep Learning moDels in NGS , 2017, bioRxiv.