SVseq: an approach for detecting exact breakpoints of deletions with low-coverage sequence data

MOTIVATION Structural variation (SV), such as deletion, is an important type of genetic variation and may be associated with diseases. While there are many existing methods for detecting SVs, finding deletions is still challenging with low-coverage short sequence reads. Existing deletion finding methods for sequence reads either use the so-called split reads mapping for detecting deletions with exact breakpoints, or rely on discordant insert sizes to estimate approximate positions of deletions. Neither is completely satisfactory with low-coverage sequence reads. RESULTS We present SVseq, an efficient two-stage approach, which combines the split reads mapping and discordant insert size analysis. The first stage is split reads mapping based on the Burrows-Wheeler transform (BWT), which finds candidate deletions. Our split reads mapping method allows mismatches and small indels, thus deletions near other small variations can be discovered and reads with sequencing errors can be utilized. The second stage filters the false positives by analyzing discordant insert sizes. SVseq is more accurate than an alternative approach when applying on simulated data and empirical data, and is also much faster. AVAILABILITY The program SVseq can be downloaded at http://www.engr.uconn.edu/~jiz08001/ CONTACT jinzhang@engr.uconn.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  Paul Medvedev,et al.  Computational methods for discovering structural variation with next-generation sequencing , 2009, Nature Methods.

[2]  Giovanni Manzini,et al.  Opportunistic data structures with applications , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[3]  R. Wilson,et al.  BreakDancer: An algorithm for high resolution mapping of genomic structural variation , 2009, Nature Methods.

[4]  Kenny Q. Ye,et al.  Large-Scale Copy Number Polymorphism in the Human Genome , 2004, Science.

[5]  R. Durbin,et al.  Mapping Quality Scores Mapping Short Dna Sequencing Reads and Calling Variants Using P

, 2022 .

[6]  Kenny Q. Ye,et al.  Mapping copy number variation by population scale genome sequencing , 2010, Nature.

[7]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[8]  Kai Ye,et al.  Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads , 2009, Bioinform..

[9]  Seunghak Lee,et al.  MoGUL: Detecting Common Insertions and Deletions in a Population , 2010, RECOMB.

[10]  Ali Bashir,et al.  A geometric approach for classification and comparison of structural variants , 2009, Bioinform..

[11]  E. Eichler,et al.  Combinatorial algorithms for structural variation detection in high-throughput sequenced genomes. , 2009, Genome research.

[12]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[13]  M. Gerstein,et al.  PEMer: a computational framework with simulation-based error models for inferring genomic structural variants from massive paired-end sequencing data , 2009, Genome Biology.

[14]  J. Kitzman,et al.  Personalized Copy-Number and Segmental Duplication Maps using Next-Generation Sequencing , 2009, Nature Genetics.

[15]  R. Durbin,et al.  Dindel: accurate indel calls from short-read data. , 2011, Genome research.

[16]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .