Engineering External Memory LCP Array Construction: Parallel, In-Place and Large Alphabet

The suffix array augmented with the LCP array is perhaps the most important data structure in modern string processing. There has been a lot of recent research activity on constructing these arrays in external memory. In this paper, we engineer the two fastest LCP array construction algorithms (ESA 2016) and improve them in three ways. First, we speed up the algorithms by up to a factor of two through parallelism. Just 8 threads is sufficient for making the algorithms essentially I/O bound. Second, we reduce the disk space usage of the algorithms making them in-place: The input (text and suffix array) is treated as read-only and the working disk space never exceeds the size of the final output (the LCP array). Third, we add support for large alphabets. All previous implementations assume the byte alphabet. 1998 ACM Subject Classification E.1 Data Structures, F.2.2 Nonnumerical Algorithms and Problems

[1]  Juha Kärkkäinen,et al.  Engineering External Memory Induced Suffix Sorting , 2017, ALENEX.

[2]  Enno Ohlebusch,et al.  Bioinformatics Algorithms: Sequence Analysis, Genome Rearrangements, and Phylogenetic Reconstruction , 2013 .

[3]  Gonzalo Navarro,et al.  Compressed full-text indexes , 2007, CSUR.

[4]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[5]  Juha Kärkkäinen,et al.  Tighter bounds for the sum of irreducible LCP values , 2016, Theor. Comput. Sci..

[6]  Juha Kärkkäinen,et al.  LCP Array Construction Using O(sort(n)) (or Less) I/Os , 2016, SPIRE.

[7]  Gaston H. Gonnet,et al.  New Indices for Text: Pat Trees and Pat Arrays , 1992, Information Retrieval: Data Structures & Algorithms.

[8]  Enno Ohlebusch,et al.  Replacing suffix trees with enhanced suffix arrays , 2004, J. Discrete Algorithms.

[9]  Juha Kärkkäinen,et al.  LCP Array Construction in External Memory , 2014, SEA.

[10]  Juha Kärkkäinen,et al.  Permuted Longest-Common-Prefix Array , 2009, CPM.

[11]  German Tischler,et al.  Low Space External Memory Construction of the Succinct Permuted Longest Common Prefix Array , 2016, SPIRE.

[12]  Kunihiko Sadakane,et al.  Succinct representations of lcp information and improvements in the compressed suffix arrays , 2002, SODA '02.

[13]  Juha Kärkkäinen,et al.  Faster External Memory LCP Array Construction , 2016, ESA.

[14]  Hugh E. Williams,et al.  Compressing Integers for Fast File Access , 1999, Comput. J..

[15]  Vitaly Osipov,et al.  Inducing Suffix and Lcp Arrays in External Memory , 2013, ALENEX.

[16]  Hiroki Arimura,et al.  Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications , 2001, CPM.

[17]  Alexandru I. Tomescu,et al.  Genome-Scale Algorithm Design: Biological Sequence Analysis in the Era of High-Throughput Sequencing , 2015 .