Readjoiner: a fast and memory efficient string graph-based sequence assembler

BackgroundOngoing improvements in throughput of the next-generation sequencing technologies challenge the current generation of de novo sequence assemblers. Most recent sequence assemblers are based on the construction of a de Bruijn graph. An alternative framework of growing interest is the assembly string graph, not necessitating a division of the reads into k-mers, but requiring fast algorithms for the computation of suffix-prefix matches among all pairs of reads.ResultsHere we present efficient methods for the construction of a string graph from a set of sequencing reads. Our approach employs suffix sorting and scanning methods to compute suffix-prefix matches. Transitive edges are recognized and eliminated early in the process and the graph is efficiently constructed including irreducible edges only.ConclusionsOur suffix-prefix match determination and string graph construction algorithms have been implemented in the software package Readjoiner. Comparison with existing string graph-based assemblers shows that Readjoiner is faster and more space efficient. Readjoiner is available at http://www.zbh.uni-hamburg.de/readjoiner.

[1]  Edward Fredkin,et al.  Trie memory , 1960, Commun. ACM.

[2]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[3]  Gad M. Landau,et al.  An Efficient Algorithm for the All Pairs Suffix-Prefix Problem , 1992, Inf. Process. Lett..

[4]  Jon Louis Bentley,et al.  Engineering a sort function , 1993, Softw. Pract. Exp..

[5]  Eugene W. Myers,et al.  Toward Simplifying and Accurately Formulating Fragment Assembly , 1995, J. Comput. Biol..

[6]  Roberto Grossi,et al.  The string B-tree: a new data structure for string search in external memory and its applications , 1999, JACM.

[7]  P. Pevzner,et al.  An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[8]  S. Salzberg,et al.  Versatile and open software for comparing large genomes , 2004, Genome Biology.

[9]  S. Salzberg,et al.  Hierarchical scaffolding with Bambus. , 2003, Genome research.

[10]  Enno Ohlebusch,et al.  Replacing suffix trees with enhanced suffix arrays , 2004, J. Discrete Algorithms.

[11]  Eugene W. Myers,et al.  The fragment assembly string graph , 2005, ECCB/JBI.

[12]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[13]  Mark J. P. Chaisson,et al.  Short read fragment assembly of bacterial genomes. , 2008, Genome research.

[14]  David Hernández,et al.  De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer. , 2008, Genome research.

[15]  Juha Kärkkäinen,et al.  Engineering Radix Sort for Strings , 2008, SPIRE.

[16]  John D McPherson,et al.  Next-generation gap , 2009, Nature Methods.

[17]  Steven J. M. Jones,et al.  Abyss: a Parallel Assembler for Short Read Sequence Data Material Supplemental Open Access , 2022 .

[18]  Jared T. Simpson,et al.  Efficient construction of an assembly string graph using the FM-index , 2010, Bioinform..

[19]  David R. Kelley,et al.  Quake: quality-aware detection and correction of sequencing errors , 2010, Genome Biology.

[20]  Enno Ohlebusch,et al.  Efficient algorithms for the all-pairs suffix-prefix problem and the all-pairs substring-prefix problem , 2010, Inf. Process. Lett..

[21]  Nuno A. Fonseca,et al.  Assemblathon 1: a competitive assessment of de novo short read assembly methods. , 2011, Genome research.

[22]  Sanguthevar Rajasekaran,et al.  A memory-efficient data structure representing exact-match overlap graphs with application for next-generation DNA assembly , 2010, Bioinform..

[23]  Nilgun Donmez,et al.  Hapsembler: An Assembler for Highly Polymorphic Genomes , 2011, RECOMB.

[24]  Dominique Lavenier,et al.  Localized Genome Assembly from Reads to Scaffolds: Practical Traversal of the Paired String Graph , 2011, WABI.

[25]  Walter Pirovano,et al.  BIOINFORMATICS APPLICATIONS , 2022 .

[26]  S. Young,et al.  Plantagora: Modeling Whole Genome Sequencing and Assembly of Plant Genomes , 2011, PloS one.

[27]  Sascha Steinbiss,et al.  A New Efficient Data Structure for Storage and Retrieval of Multiple Biosequences , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[28]  R. Durbin,et al.  Efficient de novo assembly of large genomes using compressed data structures. , 2012, Genome research.