Hybrid de novo tandem repeat detection using short and long reads

BackgroundAs one of the most studied genome rearrangements, tandem repeats have a considerable impact on genetic backgrounds of inherited diseases. Many methods designed for tandem repeat detection on reference sequences obtain high quality results. However, in the case of a de novo context, where no reference sequence is available, tandem repeat detection remains a difficult problem. The short reads obtained with the second-generation sequencing methods are not long enough to span regions that contain long repeats. This length limitation was tackled by the long reads obtained with the third-generation sequencing platforms such as Pacific Biosciences technologies. Nevertheless, the gain on the read length came with a significant increase of the error rate. The main objective of nowadays studies on long reads is to handle the high error rate up to 16%.MethodsIn this paper we present MixTaR, the first de novo method for tandem repeat detection that combines the high-quality of short reads and the large length of long reads. Our hybrid algorithm uses the set of short reads for tandem repeat pattern detection based on a de Bruijn graph. These patterns are then validated using the long reads, and the tandem repeat sequences are constructed using local greedy assemblies.ResultsMixTaR is tested with both simulated and real reads from complex organisms. For a complete analysis of its robustness to errors, we use short and long reads with different error rates. The results are then analysed in terms of number of tandem repeats detected and the length of their patterns.ConclusionsOur method shows high precision and sensitivity. With low false positive rates even for highly erroneous reads, MixTaR is able to detect accurate tandem repeats with pattern lengths varying within a significant interval.

[1]  J. Jurka,et al.  Repetitive sequences in complex genomes: structure and evolution. , 2007, Annual review of genomics and human genetics.

[2]  W. Wong,et al.  Improving PacBio Long Read Accuracy by Short Read Alignment , 2012, PloS one.

[3]  Glenn Tesler,et al.  Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory , 2012, BMC Bioinformatics.

[4]  W. Ansorge Next-generation DNA sequencing techniques. , 2009, New biotechnology.

[5]  Knut Reinert,et al.  SeqAn An efficient, generic C++ library for sequence analysis , 2008, BMC Bioinformatics.

[6]  Ali Bashir,et al.  Resolving complex tandem repeats with long reads , 2014, Bioinform..

[7]  G. Benson,et al.  Investigation of the population structure of Legionella pneumophila by analysis of tandem repeat copy number and internal sequence variation. , 2011, Microbiology.

[8]  S. Koren,et al.  Assembly algorithms for next-generation sequencing data. , 2010, Genomics.

[9]  Bairong Shen,et al.  A Practical Comparison of De Novo Genome Assembly Software Tools for Next-Generation Sequencing Technologies , 2011, PloS one.

[10]  L. Singh,et al.  Genome-wide analysis of microsatellite repeats in humans: their abundance and density in specific genomic regions , 2003, Genome Biology.

[11]  Michael S. Waterman,et al.  Introduction to computational biology , 1995 .

[12]  C. Liang,et al.  Genome-Wide Analysis of Tandem Repeats in Plants and Green Algae , 2013, G3: Genes, Genomes, Genetics.

[13]  S. Salzberg,et al.  Repetitive DNA and next-generation sequencing: computational challenges and solutions , 2011, Nature Reviews Genetics.

[14]  M. Schatz,et al.  Hybrid error correction and de novo assembly of single-molecule sequencing reads , 2012, Nature Biotechnology.

[15]  T. Thomas,et al.  GemSIM: general, error-model based simulator of next-generation sequencing data , 2012, BMC Genomics.

[16]  Guillaume Fertin,et al.  DExTaR: Detection of exact tandem repeats based on the de Bruijn graph , 2014, 2014 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[17]  Susan R. Wilson INTRODUCTION TO COMPUTATIONAL BIOLOGY: MAPS, SEQUENCES AND GENOMES. , 1996 .

[18]  H. Garner,et al.  Molecular origins of rapid and continuous morphological evolution , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[19]  M. Schatz,et al.  Algorithms Gage: a Critical Evaluation of Genome Assemblies and Assembly Material Supplemental , 2008 .

[20]  Gregory Kucherov,et al.  mreps: efficient and flexible detection of tandem repeats in DNA , 2003, Nucleic Acids Res..

[21]  Mihai Pop,et al.  Assessing the benefits of using mate-pairs to resolve repeats in de novo short-read prokaryotic assemblies , 2011, BMC Bioinformatics.

[22]  Andrzej Polanski,et al.  BWtrs: A tool for searching for tandem repeats in DNA sequences based on the Burrows-Wheeler transform. , 2010, Genomics.

[23]  René L. Warren,et al.  Assembling millions of short DNA sequences using SSAKE , 2006, Bioinform..

[24]  Kiyoshi Asai,et al.  PBSIM: PacBio reads simulator - toward accurate genome assembly , 2013, Bioinform..

[25]  Chee Keong Kwoh,et al.  Review of tandem repeat search tools: a systematic approach to evaluating algorithmic performance , 2013, Briefings Bioinform..

[26]  Dominique Lavenier,et al.  GATB: Genome Assembly & Analysis Tool Box , 2014, Bioinform..

[27]  Thomas Hackl,et al.  proovread: large-scale high-accuracy PacBio correction through iterative short read consensus , 2014, Bioinform..

[28]  Michael S. Waterman,et al.  A New Algorithm for DNA Sequence Assembly , 1995, J. Comput. Biol..

[29]  Fran Lewitter,et al.  Intragenic tandem repeats generate functional variability , 2005, Nature Genetics.

[30]  Aaron A. Klammer,et al.  Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data , 2013, Nature Methods.

[31]  Mark J. P. Chaisson,et al.  Reconstructing complex regions of genomes using long-read sequencing technology , 2014, Genome research.

[32]  Christoph Mayer,et al.  Genome-wide analysis of tandem repeats in Daphnia pulex - a comparative approach , 2010, BMC Genomics.

[33]  Michael S. Waterman,et al.  Introduction to Computational Biology: Maps, Sequences and Genomes , 1998 .

[34]  C. Nusbaum,et al.  ALLPATHS: de novo assembly of whole-genome shotgun microreads. , 2008, Genome research.

[35]  Daniel R. Zerbino,et al.  Pebble and Rock Band: Heuristic Resolution of Repeats and Scaffolding in the Velvet Short-Read de Novo Assembler , 2009, PloS one.

[36]  Donald B. Johnson,et al.  Finding All the Elementary Circuits of a Directed Graph , 1975, SIAM J. Comput..

[37]  G. Benson,et al.  Tandem repeats finder: a program to analyze DNA sequences. , 1999, Nucleic acids research.

[38]  James H. Bullard,et al.  A hybrid approach for the automated finishing of bacterial genomes , 2012, Nature Biotechnology.

[39]  Alla Lapidus,et al.  ExSPAnder: a universal repeat resolver for DNA fragment assembly , 2014, Bioinform..

[40]  Jens Stoye,et al.  Simple and flexible detection of contiguous repeats using a suffix tree , 2002, Theor. Comput. Sci..

[41]  Gary Benson,et al.  TRDB—The Tandem Repeats Database , 2006, Nucleic Acids Res..

[42]  Vineet Bafna,et al.  Cerulean: A Hybrid Assembly Using High Throughput Short and Long Reads , 2013, WABI.

[43]  Mauricio O. Carneiro,et al.  Pacific biosciences sequencing technology for genotyping and variation discovery in human data , 2012, BMC Genomics.

[44]  P. Pevzner,et al.  An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[45]  Kazuharu Misawa,et al.  RF: a method for filtering short reads with tandem repeats for genome mapping. , 2013, Genomics.

[46]  Leena Salmela,et al.  LoRDEC: accurate and efficient long read error correction , 2014, Bioinform..

[47]  Katharina J Hoff,et al.  The effect of sequencing errors on metagenomic gene prediction , 2009, BMC Genomics.

[48]  D. Coil,et al.  Intragenic tandem repeat variation between Legionella pneumophila strains , 2008, BMC Microbiology.