BreakNet: detecting deletions using long reads and a deep learning approach

Background Structural variations (SVs) occupy a prominent position in human genetic diversity, and deletions form an important type of SV that has been suggested to be associated with genetic diseases. Although various deletion calling methods based on long reads have been proposed, a new approach is still needed to mine features in long-read alignment information. Recently, deep learning has attracted much attention in genome analysis, and it is a promising technique for calling SVs. Results In this paper, we propose BreakNet, a deep learning method that detects deletions by using long reads. BreakNet first extracts feature matrices from long-read alignments. Second, it uses a time-distributed convolutional neural network (CNN) to integrate and map the feature matrices to feature vectors. Third, BreakNet employs a bidirectional long short-term memory (BLSTM) model to analyse the produced set of continuous feature vectors in both the forward and backward directions. Finally, a classification module determines whether a region refers to a deletion. On real long-read sequencing datasets, we demonstrate that BreakNet outperforms Sniffles, SVIM and cuteSV in terms of their F1 scores. The source code for the proposed method is available from GitHub at https://github.com/luojunwei/BreakNet . Conclusions Our work shows that deep learning can be combined with long reads to call deletions more effectively than existing methods.

[1]  Thomas Colthurst,et al.  A universal SNP and small-indel variant caller using deep neural networks , 2018, Nature Biotechnology.

[2]  Bradley P. Coe,et al.  Genome structural variation discovery and genotyping , 2011, Nature Reviews Genetics.

[3]  Yadong Wang,et al.  Long-read-based human genomic structural variation detection with cuteSV , 2020, Genome Biology.

[4]  Joshua M. Korn,et al.  Association between microdeletion and microduplication at 16p11.2 and autism. , 2008, The New England journal of medicine.

[5]  Wei Lan,et al.  Computational Approaches for Prioritizing Candidate Disease Genes Based on PPI Networks , 2015 .

[6]  J. Lupski,et al.  Mechanisms of change in gene copy number , 2009, Nature Reviews Genetics.

[7]  Christophe Dessimoz,et al.  Structural variant calling: the long and the short of it , 2019, Genome Biology.

[8]  Gabor T. Marth,et al.  An integrated map of structural variation in 2,504 human genomes , 2015, Nature.

[9]  Tam P. Sneddon,et al.  Long-read genome sequencing identifies causal structural variation in a Mendelian disease , 2017, Genetics in Medicine.

[10]  D. Conrad,et al.  Global variation in copy number in the human genome , 2006, Nature.

[11]  Enhua Wu,et al.  Squeeze-and-Excitation Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Quoc V. Le,et al.  EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.

[13]  Martin Vingron,et al.  SVIM: structural variant identification using mapped long reads , 2018, bioRxiv.

[14]  Glenn Tesler,et al.  Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory , 2012, BMC Bioinformatics.

[15]  Adam C. English,et al.  PBHoney: identifying genomic variants via long-read discordance and interrupted mapping , 2014, BMC Bioinformatics.

[16]  Li Ding,et al.  Multi-platform discovery of haplotype-resolved structural variation in human genomes , 2018, Nature Communications.

[17]  Thomas W. Mühleisen,et al.  Large recurrent microdeletions associated with schizophrenia , 2008, Nature.

[18]  Michael C. Schatz,et al.  Accurate detection of complex structural variations using single molecule sequencing , 2017, Nature Methods.

[19]  Yufeng Wu,et al.  DeepSV: accurate calling of genomic deletions from high-throughput sequencing data using deep convolutional neural network , 2019, BMC Bioinformatics.

[20]  Jin Liu,et al.  LDICDL: LncRNA-Disease Association Identification Based on Collaborative Deep Learning , 2020, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[21]  Thomas Zichner,et al.  DELLY: structural variant discovery by integrated paired-end and split-read analysis , 2012, Bioinform..

[22]  Ryan M. Layer,et al.  LUMPY: a probabilistic framework for structural variant discovery , 2012, Genome Biology.

[23]  R. Wilson,et al.  BreakDancer: An algorithm for high resolution mapping of genomic structural variation , 2009, Nature Methods.