Repeat- and error-aware comparison of

Motivation: The number of reported genetic variants is rapidly growing, empowered by ever faster accumulation of next-generation sequencing data. A major issue is comparability. Standards that address the combined problem of inaccurately predicted breakpoints and repeat-induced ambiguities are missing. This decisively lowers the quality of ‘consensus’ callsets and hampers the removal of duplicate entries in variant databases, which can have deleterious effects in downstream analyses. Results: We introduce a sound framework for comparison of deletions that captures both toolinduced inaccuracies and repeat-induced ambiguities. We present a maximum matching algorithm that outputs virtual duplicates among two sets of predictions/annotations. We demonstrate that our approach is clearly superior over ad hoc criteria, like overlap, and that it can reduce the redundancy among callsets substantially. We also identify large amounts of duplicate entries in the Database of Genomic Variants, which points out the immediate relevance of our approach. Availability and implementation: Implementation is open source and available from https://

[1]  L. Feuk,et al.  Development of bioinformatics resources for display and analysis of copy number and other structural variants in the human genome , 2006, Cytogenetic and Genome Research.

[2]  Jamie K Teer,et al.  Massively-parallel sequencing of genes on a single chromosome: a comparison of solution hybrid selection and flow sorting , 2013, BMC Genomics.

[3]  Veli Mäkinen,et al.  Haploid to diploid alignment for variation calling assessment , 2013, BMC Bioinformatics.

[4]  Tae-Min Kim,et al.  Detecting structural variations in the human genome using next generation sequencing. , 2010, Briefings in functional genomics.

[5]  Alex Rodriguez,et al.  Consensus Genotyper for Exome Sequencing (CGES): improving the quality of exome variant genotypes , 2015, Bioinform..

[6]  Gad M. Landau,et al.  Fast Parallel and Serial Approximate String Matching , 1989, J. Algorithms.

[7]  Kai Ye,et al.  Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads , 2009, Bioinform..

[8]  Jens Stoye,et al.  Linear time algorithms for finding and representing all the tandem repeats in a string , 2004, J. Comput. Syst. Sci..

[9]  Ryan M. Layer,et al.  Breakpoint profiling of 64 cancer genomes reveals numerous complex rearrangements spawned by homology-independent mechanisms , 2013, Genome research.

[10]  G. McVean,et al.  Integrating mapping-, assembly- and haplotype-based approaches for calling variants in clinical sequencing applications , 2014, Nature Genetics.

[11]  Timothy B. Stockwell,et al.  The Diploid Genome Sequence of an Individual Human , 2007, PLoS biology.

[12]  Alexander Schliep,et al.  CLEVER: clique-enumerating variant finder , 2012, Bioinform..

[13]  Bradley P. Coe,et al.  Genome structural variation discovery and genotyping , 2011, Nature Reviews Genetics.

[14]  Iman Hajirasouliha,et al.  MATE-CLEVER: Mendelian-inheritance-aware discovery and genotyping of midsize and long indels , 2013, Bioinform..

[15]  Robert Giegerich,et al.  An Algebraic Dynamic Programming Approach to the Analysis of Recombinant DNA Sequences , 2003 .

[16]  Philip Lijnzaad,et al.  The Ensembl genome database project , 2002, Nucleic Acids Res..

[17]  Gonçalo R. Abecasis,et al.  The variant call format and VCFtools , 2011, Bioinform..

[18]  R. Wilson,et al.  BreakDancer: An algorithm for high resolution mapping of genomic structural variation , 2009, Nature Methods.

[19]  Sebastian Bauer,et al.  Microindel detection in short-read sequence data , 2010, Bioinform..

[20]  Elizabeth M. Smigielski,et al.  dbSNP: the NCBI database of genetic variation , 2001, Nucleic Acids Res..

[21]  Nuno A. Fonseca,et al.  Assemblathon 1: a competitive assessment of de novo short read assembly methods. , 2011, Genome research.

[22]  Paul Medvedev,et al.  Computational methods for discovering structural variation with next-generation sequencing , 2009, Nature Methods.

[23]  Alexandre Z. Caldeira,et al.  Uncertainty in homology inferences: assessing and improving genomic sequence alignment. , 2008, Genome research.

[24]  Jürgen Kleffe,et al.  Equivalent Indels – Ambiguous Functional Classes and Redundancy in Databases , 2013, PloS one.

[25]  S. Salzberg,et al.  Repetitive DNA and next-generation sequencing: computational challenges and solutions , 2011, Nature Reviews Genetics.

[26]  Hugo Y. K. Lam,et al.  Detecting and annotating genetic variations using the HugeSeq pipeline , 2012, Nature Biotechnology.