Sprites: detection of deletions from sequencing data by re-aligning split reads

MOTIVATION Advances of next generation sequencing technologies and availability of short read data enable the detection of structural variations (SVs). Deletions, an important type of SVs, have been suggested in association with genetic diseases. There are three types of deletions: blunt deletions, deletions with microhomologies and deletions with microsinsertions. The last two types are very common in the human genome, but they pose difficulty for the detection. Furthermore, finding deletions from sequencing data remains challenging. It is highly appealing to develop sensitive and accurate methods to detect deletions from sequencing data, especially deletions with microhomology and deletions with microinsertion. RESULTS We present a novel method called Sprites (SPlit Read re-alIgnment To dEtect Structural variants) which finds deletions from sequencing data. It aligns a whole soft-clipping read rather than its clipped part to the target sequence, a segment of the reference which is determined by spanning reads, in order to find the longest prefix or suffix of the read that has a match in the target sequence. This alignment aims to solve the problem of deletions with microhomologies and deletions with microinsertions. Using both simulated and real data we show that Sprites performs better on detecting deletions compared with other current methods in terms of F-score. AVAILABILITY AND IMPLEMENTATION Sprites is open source software and freely available at https://github.com/zhangzhen/sprites CONTACT jxwang@mail.csu.edu.cnSupplementary data: Supplementary data are available at Bioinformatics online.

[1]  Ryan M. Layer,et al.  LUMPY: a probabilistic framework for structural variant discovery , 2012, Genome Biology.

[2]  R. Wilson,et al.  BreakDancer: An algorithm for high resolution mapping of genomic structural variation , 2009, Nature Methods.

[3]  M. Gerstein,et al.  PEMer: a computational framework with simulation-based error models for inferring genomic structural variants from massive paired-end sequencing data , 2009, Genome Biology.

[4]  Kenny Q. Ye,et al.  An integrated map of genetic variation from 1,092 human genomes , 2012, Nature.

[5]  Jan O. Korbel,et al.  Phenotypic impact of genomic structural variation: insights from and for human disease , 2013, Nature Reviews Genetics.

[6]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[7]  Kenny Q. Ye,et al.  Sensitive and accurate detection of copy number variants using read depth of coverage. , 2009, Genome research.

[8]  Jin Zhang,et al.  An improved approach for accurate and efficient calling of structural variations with low-coverage sequence data , 2012, BMC Bioinformatics.

[9]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[10]  Yadong Wang,et al.  PRISM: Pair-read informed split-read mapping for base-pair level detection of insertion, deletion and structural variants , 2012, Bioinform..

[11]  Benjamin J. Raphael,et al.  An integrative probabilistic model for identification of structural variation in sequencing data , 2012, Genome Biology.

[12]  Michael C. Rusch,et al.  CREST maps somatic structural variation in cancer genomes with base-pair resolution , 2011, Nature Methods.

[13]  Min Li,et al.  EPGA2: memory-efficient de novo assembler , 2015, Bioinform..

[14]  Yi Pan,et al.  EPGA: de novo assembly using the distributions of reads and insert size , 2015, Bioinform..

[15]  Jan Schröder,et al.  Socrates: identification of genomic rearrangements in tumour genomes by re-aligning soft clipped reads , 2014, Bioinform..

[16]  Rayan Chikhi,et al.  MindTheGap: integrated detection and assembly of short and long insertions , 2014, Bioinform..

[17]  Benjamin P. Blackburne,et al.  Mutation spectrum revealed by breakpoint sequencing of human germline CNVs , 2010, Nature Genetics.

[18]  Bradley P. Coe,et al.  Genome structural variation discovery and genotyping , 2011, Nature Reviews Genetics.

[19]  Ali Bashir,et al.  A geometric approach for classification and comparison of structural variants , 2009, Bioinform..

[20]  Faraz Hach,et al.  Next-generation VariationHunter: combinatorial algorithms for transposon insertion discovery , 2010, Bioinform..

[21]  Jin Zhang,et al.  SVseq: an approach for detecting exact breakpoints of deletions with low-coverage sequence data , 2011, Bioinform..

[22]  Ira M. Hall,et al.  YAHA: fast and flexible long-read alignment with optimal breakpoint detection , 2012, Bioinform..

[23]  Thomas Zichner,et al.  DELLY: structural variant discovery by integrated paired-end and split-read analysis , 2012, Bioinform..

[24]  Derek Y. Chiang,et al.  High-resolution mapping of copy-number alterations with massively parallel sequencing , 2009, Nature Methods.

[25]  Aaron R. Quinlan,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2022 .

[26]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[27]  L. Feuk,et al.  Structural variation in the human genome , 2006, Nature Reviews Genetics.

[28]  M. Gerstein,et al.  CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. , 2011, Genome research.

[29]  Mark Gerstein,et al.  AGE: defining breakpoints of genomic structural variants at single-nucleotide resolution, through optimal alignments with gap excision , 2011, Bioinform..

[30]  Masao Nagasaki,et al.  ClipCrop: a tool for detecting structural variations with single-base resolution using soft-clipping information , 2011, BMC Bioinformatics.

[31]  Heng Li Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM , 2013, 1303.3997.

[32]  Monya Baker,et al.  Structural variation: the genome's hidden architecture , 2012, Nature Methods.

[33]  Kai Ye,et al.  Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads , 2009, Bioinform..

[34]  Kenny Q. Ye,et al.  Mapping copy number variation by population scale genome sequencing , 2010, Nature.

[35]  Hugo Y. K. Lam,et al.  Erratum: Analysis of deletion breakpoints from 1,092 humans reveals details of mutation mechanisms , 2015, Nature Communications.

[36]  E. Eichler,et al.  Combinatorial algorithms for structural variation detection in high-throughput sequenced genomes. , 2009, Genome research.