Gustaf: Detecting and correctly classifying SVs in the NGS twilight zone

MOTIVATION The landscape of structural variation (SV) including complex duplication and translocation patterns is far from resolved. SV detection tools usually exhibit low agreement, are often geared toward certain types or size ranges of variation and struggle to correctly classify the type and exact size of SVs. RESULTS We present Gustaf (Generic mUlti-SpliT Alignment Finder), a sound generic multi-split SV detection tool that detects and classifies deletions, inversions, dispersed duplications and translocations of ≥ 30 bp. Our approach is based on a generic multi-split alignment strategy that can identify SV breakpoints with base pair resolution. We show that Gustaf correctly identifies SVs, especially in the range from 30 to 100 bp, which we call the next-generation sequencing (NGS) twilight zone of SVs, as well as larger SVs >500 bp. Gustaf performs better than similar tools in our benchmark and is furthermore able to correctly identify size and location of dispersed duplications and translocations, which otherwise might be wrongly classified, for example, as large deletions.

[1]  John D. Kececioglu,et al.  The Maximum Weight Trace Problem in Multiple Sequence Alignment , 1993, CPM.

[2]  Kai Ye,et al.  Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads , 2009, Bioinform..

[3]  Lee T. Sam,et al.  Transcriptome Sequencing to Detect Gene Fusions in Cancer , 2009, Nature.

[4]  Alexander Schliep,et al.  CLEVER: clique-enumerating variant finder , 2012, Bioinform..

[5]  Richard Durbin,et al.  Fast and accurate long-read alignment with Burrows–Wheeler transform , 2010, Bioinform..

[6]  Mark D. Johnson,et al.  Copy number variation detection in whole-genome sequencing data using the Bayesian information criterion , 2011, Proceedings of the National Academy of Sciences.

[7]  Eugene W. Myers,et al.  Efficient q-Gram Filters for Finding All epsilon-Matches over a Given Length , 2006, J. Comput. Biol..

[8]  E. Eichler,et al.  Fine-scale structural variation of the human genome , 2005, Nature Genetics.

[9]  R. Wilson,et al.  BreakDancer: An algorithm for high resolution mapping of genomic structural variation , 2009, Nature Methods.

[10]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[11]  P. Pevzner,et al.  De Novo Repeat Classification and Fragment Assembly , 2004 .

[12]  E. Birney,et al.  Enredo and Pecan: genome-wide mammalian consistency-based multiple alignment with paralogs. , 2008, Genome research.

[13]  G. Weinstock,et al.  TIGRA: A targeted iterative graph routing assembler for breakpoint assembly , 2014, Genome research.

[14]  Thomas Zichner,et al.  DELLY: structural variant discovery by integrated paired-end and split-read analysis , 2012, Bioinform..

[15]  Martin Vingron,et al.  Detecting genomic indel variants with exact breakpoints in single- and paired-end sequencing data using SplazerS , 2012, Bioinform..

[16]  M. Gerstein,et al.  CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. , 2011, Genome research.

[17]  Manuel Holtgrewe,et al.  Mason – A Read Simulator for Second Generation Sequencing Data , 2010 .

[18]  Kenny Q. Ye,et al.  Sensitive and accurate detection of copy number variants using read depth of coverage. , 2009, Genome research.

[19]  Bradley P. Coe,et al.  Genome structural variation discovery and genotyping , 2011, Nature Reviews Genetics.

[20]  Knut Reinert,et al.  STELLAR: fast and exact local alignments , 2011, BMC Bioinformatics.

[21]  Iman Hajirasouliha,et al.  MATE-CLEVER: Mendelian-inheritance-aware discovery and genotyping of midsize and long indels , 2013, Bioinform..

[22]  Mark Gerstein,et al.  AGE: defining breakpoints of genomic structural variants at single-nucleotide resolution, through optimal alignments with gap excision , 2011, Bioinform..

[23]  Knut Reinert,et al.  SeqAn An efficient, generic C++ library for sequence analysis , 2008, BMC Bioinformatics.

[24]  E. Eichler,et al.  Combinatorial algorithms for structural variation detection in high-throughput sequenced genomes. , 2009, Genome research.

[25]  Megumi Onishi-Seebacher,et al.  Challenges in studying genomic structural variant formation mechanisms: The short‐read dilemma and beyond , 2011, BioEssays : news and reviews in molecular, cellular and developmental biology.

[26]  Inna Dubchak,et al.  Glocal alignment: finding rearrangements during alignment , 2003, ISMB.