Enhancing breakpoint resolution with deep segmentation model: A general refinement method for read-depth based structural variant callers

Read-depths (RDs) are frequently used in identifying structural variants (SVs) from sequencing data. For existing RD-based SV callers, it is difficult for them to determine breakpoints in single-nucleotide resolution due to the noisiness of RD data and the bin-based calculation. In this paper, we propose to use the deep segmentation model UNet to learn base-wise RD patterns surrounding breakpoints of known SVs. We integrate model predictions with an RD-based SV caller to enhance breakpoints in single-nucleotide resolution. We show that UNet can be trained with a small amount of data and can be applied both in-sample and cross-sample. An enhancement pipeline named RDBKE significantly increases the number of SVs with more precise breakpoints on simulated and real data. The source code of RDBKE is freely available at https://github.com/yaozhong/deepIntraSV.

[1]  Thomas Zichner,et al.  DELLY: structural variant discovery by integrated paired-end and split-read analysis , 2012, Bioinform..

[2]  Thomas M. Keane,et al.  Enhanced structural variant and breakpoint detection using SVMerge by integration of multiple detection methods and local assembly , 2010, Genome Biology.

[3]  Wolfgang Losert,et al.  svclassify: a method to establish benchmark structural variant calls , 2015, BMC Genomics.

[4]  Thomas Colthurst,et al.  A universal SNP and small-indel variant caller using deep neural networks , 2018, Nature Biotechnology.

[5]  Y. Kamatani,et al.  Comprehensive evaluation of structural variation detection algorithms for whole genome sequencing , 2019, Genome Biology.

[6]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[7]  Erik Larsson,et al.  Global analysis of somatic structural genomic alterations and their impact on gene expression in diverse human cancers , 2016, Proceedings of the National Academy of Sciences.

[8]  Ryan M. Layer,et al.  LUMPY: a probabilistic framework for structural variant discovery , 2012, Genome Biology.

[9]  David T. W. Jones,et al.  Genome Sequencing of Pediatric Medulloblastoma Links Catastrophic DNA Rearrangements with TP53 Mutations , 2012, Cell.

[10]  Leon Di Stefano,et al.  Comprehensive evaluation and characterisation of short read general-purpose structural variant calling software , 2019, Nature Communications.

[11]  Johanna C Andersson-Assarsson,et al.  Low copy number of the salivary amylase gene predisposes to obesity , 2014, Nature Genetics.

[12]  M. Gerstein,et al.  CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. , 2011, Genome research.

[13]  Li Ding,et al.  Multi-platform discovery of haplotype-resolved structural variation in human genomes , 2018, Nature Communications.

[14]  Brent S. Pedersen,et al.  Duphold: scalable, depth-based annotation and curation of high-confidence structural variant calls , 2019, GigaScience.

[15]  L. R. Dice Measures of the Amount of Ecologic Association Between Species , 1945 .

[16]  Qingguo Wang,et al.  Computational tools for copy number variation (CNV) detection using next-generation sequencing data: features and perspectives , 2013, BMC Bioinformatics.

[17]  Andy Wing Chun Pang,et al.  A multi-platform reference for somatic structural variation detection , 2020, bioRxiv.

[18]  Leping Li,et al.  ART: a next-generation sequencing read simulator , 2012, Bioinform..

[19]  R. Tanzi,et al.  Rare autosomal copy number variations in early-onset familial Alzheimer’s disease , 2014, Molecular Psychiatry.

[20]  Martin Vingron,et al.  Breakpointer: using local mapping artifacts to support sequence breakpoint discovery from single-end reads , 2012, Bioinform..

[21]  Thomas Brox,et al.  U-Net: deep learning for cell counting, detection, and morphometry , 2018, Nature Methods.

[22]  G. McVean,et al.  De novo assembly and genotyping of variants using colored de Bruijn graphs , 2011, Nature Genetics.