Concod: Accurate consensus-based approach of calling deletions from high-throughput sequencing data

Accurate calling of structural variations such as deletions with short sequence reads from high-throughput sequencing is an important but challenging problem in the field of genome analysis. There are many existing methods for calling deletions. At present, not a single method clearly outperforms all other methods in precision and sensitivity. A popular strategy used by several authors is combining different signatures left by deletions in order to achieve more accurate deletion calling. However, most existing methods using the combining approach are heuristic and the called deletions by these tools still contain many wrongly called deletions. In this paper, we present Concod, a machine learning based framework for calling deletions with consensus, which is able to more accurately detect and distinguish true deletions from falsely called ones. First, Concod collects candidate deletions by merging the output of multiple existing deletion calling tools. Then, features of each candidate are extracted from aligned reads based on multiple detection theories. Finally, a machine learning model is trained with these features and used to classify the true and false candidates. We test our approach on different coverage of real data and compare with existing tools, including Pindel, SVseq2, BreakDancer, and DELLY. Results show that Concod improves both precision and sensitivity of deletion calling significantly.

[1]  Bradley P. Coe,et al.  Genome structural variation discovery and genotyping , 2011, Nature Reviews Genetics.

[2]  Kai Ye,et al.  Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads , 2009, Bioinform..

[3]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[4]  Mark Gerstein,et al.  MetaSV: an accurate and integrative structural-variant caller for next generation sequencing , 2015, Bioinform..

[5]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[6]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[7]  Gabor T. Marth,et al.  An integrated map of structural variation in 2,504 human genomes , 2015, Nature.

[8]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[9]  Thomas M. Keane,et al.  Enhanced structural variant and breakpoint detection using SVMerge by integration of multiple detection methods and local assembly , 2010, Genome Biology.

[10]  Hugo Y. K. Lam,et al.  Nucleotide-resolution analysis of structural variants using BreakSeq and a breakpoint library , 2010, Nature Biotechnology.

[11]  M. Schatz,et al.  Accurate detection of de novo and transmitted indels within exome-capture data using micro-assembly , 2014, Nature Methods.

[12]  M. Gerstein,et al.  CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. , 2011, Genome research.

[13]  Thomas Zichner,et al.  DELLY: structural variant discovery by integrated paired-end and split-read analysis , 2012, Bioinform..

[14]  Jin Zhang,et al.  An improved approach for accurate and efficient calling of structural variations with low-coverage sequence data , 2012, BMC Bioinformatics.

[15]  Rayan Chikhi,et al.  MindTheGap: integrated detection and assembly of short and long insertions , 2014, Bioinform..

[16]  Fan Zhang,et al.  IPAD: the Integrated Pathway Analysis Database for Systematic Enrichment Analysis , 2012, BMC Bioinformatics.

[17]  Yufeng Wu,et al.  GINDEL: Accurate Genotype Calling of Insertions and Deletions from Low Coverage Population Sequence Reads , 2014, PloS one.

[18]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[19]  Chao Xie,et al.  CNV-seq, a new method to detect copy number variation using high-throughput sequencing , 2009, BMC Bioinformatics.

[20]  Joshua M. Korn,et al.  Discovery and genotyping of genome structural polymorphism by sequencing on a population scale , 2011, Nature Genetics.

[21]  M. Gerstein,et al.  PEMer: a computational framework with simulation-based error models for inferring genomic structural variants from massive paired-end sequencing data , 2009, Genome Biology.

[22]  R. Wilson,et al.  BreakDancer: An algorithm for high resolution mapping of genomic structural variation , 2009, Nature Methods.