Lightweight LCP Construction for Next-Generation Sequencing Datasets

The advent of "next-generation" DNA sequencing (NGS) technologies has meant that collections of hundreds of millions of DNA sequences are now commonplace in bioinformatics. Knowing the longest common prefix array (LCP) of such a collection would facilitate the rapid computation of maximal exact matches, shortest unique substrings and shortest absent words. CPU-efficient algorithms for computing the LCP of a string have been described in the literature, but require the presence in RAM of large data structures. This prevents such methods from being feasible for NGS datasets. In this paper we propose the first lightweight method that simultaneously computes, via sequential scans, the LCP and BWT of very large collections of sequences. Computational results on collections as large as 800 million 100-mers demonstrate that our algorithm scales to the vast sequence collections encountered in human whole genome sequencing experiments.

[1]  Hideaki Sugawara,et al.  The Sequence Read Archive , 2010, Nucleic Acids Res..

[2]  Fei Shi,et al.  Suffix Arrays for Multiple Strings: A Method for On-Line Multiple String Searches , 1996, ASIAN.

[3]  Juha Kärkkäinen,et al.  Permuted Longest-Common-Prefix Array , 2009, CPM.

[4]  Enno Ohlebusch,et al.  Computing Matching Statistics and Maximal Exact Matches on Compressed Full-Text Indexes , 2010, SPIRE.

[5]  Enno Ohlebusch,et al.  Computing the longest common prefix array based on the Burrows-Wheeler transform , 2011, J. Discrete Algorithms.

[6]  Johannes Fischer,et al.  Inducing the LCP-Array , 2011, WADS.

[7]  Simon J. Puglisi,et al.  Space-Time Tradeoffs for Longest-Common-Prefix Array Computation , 2008, ISAAC.

[8]  Giovanna Rosone,et al.  Lightweight algorithms for constructing and inverting the BWT of string collections , 2013, Theor. Comput. Sci..

[9]  Robert Giegerich,et al.  BMC Bioinformatics BioMed Central Methodology article Efficient computation of absent words in genomic sequences , 2008 .

[10]  Antonio Restivo,et al.  An extension of the Burrows-Wheeler Transform , 2007, Theor. Comput. Sci..

[11]  Travis Gagie,et al.  Lightweight Data Indexing and Compression in External Memory , 2009, Algorithmica.

[12]  Antonio Restivo,et al.  A New Combinatorial Approach to Sequence Comparison , 2005, Theory of Computing Systems.

[13]  Giovanna Rosone,et al.  Lightweight BWT Construction for Very Large String Collections , 2011, CPM.

[14]  Enno Ohlebusch,et al.  Fast and Lightweight LCP-Array Construction Algorithms , 2011, ALENEX.

[15]  Enno Ohlebusch,et al.  Replacing suffix trees with enhanced suffix arrays , 2004, J. Discrete Algorithms.

[16]  Kunihiko Sadakane,et al.  Compressed Suffix Trees with Full Functionality , 2007, Theory of Computing Systems.