DeepSV: accurate calling of genomic deletions from high-throughput sequencing data using deep convolutional neural network

Background Calling genetic variations from sequence reads is an important problem in genomics. There are many existing methods for calling various types of variations. Recently, Google developed a method for calling single nucleotide polymorphisms (SNPs) based on deep learning. Their method visualizes sequence reads in the forms of images. These images are then used to train a deep neural network model, which is used to call SNPs. This raises a research question: can deep learning be used to call more complex genetic variations such as structural variations (SVs) from sequence data? Results In this paper, we extend this high-level approach to the problem of calling structural variations. We present DeepSV, an approach based on deep learning for calling long deletions from sequence reads. DeepSV is based on a novel method of visualizing sequence reads. The visualization is designed to capture multiple sources of information in the sequence data that are relevant to long deletions. DeepSV also implements techniques for working with noisy training data. DeepSV trains a model from the visualized sequence reads and calls deletions based on this model. We demonstrate that DeepSV outperforms existing methods in terms of accuracy and efficiency of deletion calling on the data from the 1000 Genomes Project. Conclutions Our work shows that deep learning can potentially lead to effective calling of different types of genetic variations that are complex than SNPs. Availability and implementation DeepSV’s source code and sample result as part of this project are readily available from GitHub at https://github.com/CSuperlei/DeepSV/.

[1]  Pavel Skums,et al.  Efficient error correction for next-generation sequencing of viral amplicons , 2012, BMC Bioinformatics.

[2]  Helga Thorvaldsdóttir,et al.  Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration , 2012, Briefings Bioinform..

[3]  Life Technologies,et al.  A map of human genome variation from population-scale sequencing , 2011 .

[4]  Hugo Y. K. Lam,et al.  Nucleotide-resolution analysis of structural variants using BreakSeq and a breakpoint library , 2010, Nature Biotechnology.

[5]  R. Wilson,et al.  BreakDancer: An algorithm for high resolution mapping of genomic structural variation , 2009, Nature Methods.

[6]  M. Gerstein,et al.  CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. , 2011, Genome research.

[7]  Li Ding,et al.  Patterns and functional implications of rare germline variants across 12 cancer types , 2015, Nature Communications.

[8]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[9]  Qingguo Wang,et al.  Computational tools for copy number variation (CNV) detection using next-generation sequencing data: features and perspectives , 2013, BMC Bioinformatics.

[10]  Yufeng Wu,et al.  GINDEL: Accurate Genotype Calling of Insertions and Deletions from Low Coverage Population Sequence Reads , 2014, PloS one.

[11]  Joshua F. McMichael,et al.  Systematic Discovery of Complex Indels in Human Cancers , 2015, Nature medicine.

[12]  Helga Thorvaldsdóttir,et al.  Integrative Genomics Viewer , 2011, Nature Biotechnology.

[13]  Thomas Zichner,et al.  DELLY: structural variant discovery by integrated paired-end and split-read analysis , 2012, Bioinform..

[14]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[15]  Kai Ye,et al.  Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads , 2009, Bioinform..

[16]  Xiaodong Zhang,et al.  Concod: an effective integration framework of consensus-based calling deletions from next-generation sequencing data , 2017, Int. J. Data Min. Bioinform..

[17]  Jin Zhang,et al.  An improved approach for accurate and efficient calling of structural variations with low-coverage sequence data , 2012, BMC Bioinformatics.

[18]  Iman Hajirasouliha,et al.  MATE-CLEVER: Mendelian-inheritance-aware discovery and genotyping of midsize and long indels , 2013, Bioinform..

[19]  Peiyong Guan,et al.  Structural variation detection using next-generation sequencing data: A comparative technical review. , 2016, Methods.

[20]  R. Handsaker,et al.  Large multi-allelic copy number variations in humans , 2015, Nature Genetics.

[21]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[22]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[23]  Ryan M. Layer,et al.  LUMPY: a probabilistic framework for structural variant discovery , 2012, Genome Biology.

[24]  Cory Y. McLean,et al.  Creating a universal SNP and small indel variant caller with deep neural networks , 2016, bioRxiv.