Scalable K-Order LCP Array Construction for Massive Data

Given a size-n input text T and its suffix array, a new method is proposed to compute the K-order longest common prefix (LCP) array for T, in terms of that the maximum LCP of two suffixes is truncated to be at most K. This method employs a fingerprint function to convert a comparison of two variable-length strings into a comparison of their fingerprints encoded as fixed-size integers. This method takes \( {\text{O}}\left( {n\,\log K} \right) \) time and \( {\text{O}}\left( n \right) \) space on internal and external memory models. It is also scalable for a typical distributed model consisting of \( d \) computing nodes, where the time and space complexities are evenly divided onto each node as \( {\text{O}}\left( {n\,\log K/d} \right) \) and \( {\text{O}}\left( {n/d} \right) \), respectively. For performance evaluation, an experimental study has been conducted on both external memory and distributed models. From our perspective, a cluster of computers in a local area network is commonly available in practice, but there is currently a lack of scalable LCP-array construction algorithm for such a distributed model. Our method provides a candidate solution to meet this demand.

[1]  Cristina Dutra de Aguiar Ciferri,et al.  External Memory Generalized Suffix and LCP Arrays Construction , 2013, CPM.

[2]  Giovanna Rosone,et al.  Lightweight LCP Construction for Next-Generation Sequencing Datasets , 2013, WABI.

[3]  Vitaly Osipov,et al.  Inducing Suffix and LCP Arrays in External Memory , 2013, ALENEX.

[4]  Sen Zhang,et al.  Suffix Array Construction in External Memory Using D-Critical Substrings , 2014, TOIS.

[5]  Chirag Shah,et al.  User Activity Patterns During Information Search , 2015, ACM Trans. Inf. Syst..

[6]  Johannes Fischer,et al.  Inducing the LCP-Array , 2011, WADS.

[7]  Srinivas Aluru,et al.  Space efficient linear time construction of suffix arrays , 2003, J. Discrete Algorithms.

[8]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[9]  Enno Ohlebusch,et al.  Replacing suffix trees with enhanced suffix arrays , 2004, J. Discrete Algorithms.

[10]  William F. Smyth,et al.  A taxonomy of suffix array construction algorithms , 2007, CSUR.

[11]  Yi Wu,et al.  Induced Sorting Suffixes in External Memory , 2015, TOIS.

[12]  Richard M. Karp,et al.  Efficient Randomized Pattern-Matching Algorithms , 1987, IBM J. Res. Dev..

[13]  Hiroki Arimura,et al.  Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications , 2001, CPM.

[14]  Juha Kärkkäinen,et al.  Fast Lightweight Suffix Array Construction and Checking , 2003, CPM.

[15]  Julian Shun,et al.  Fast Parallel Computation of Longest Common Prefixes , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[16]  Sen Zhang,et al.  Two Efficient Algorithms for Linear Time Suffix Array Construction , 2011, IEEE Transactions on Computers.

[17]  Johann van der Merwe,et al.  A survey on peer-to-peer key management for mobile ad hoc networks , 2007, CSUR.

[18]  Sean Keely,et al.  Parallel suffix array and least common prefix for the GPU , 2013, PPoPP '13.

[19]  Juha Kärkkäinen,et al.  Permuted Longest-Common-Prefix Array , 2009, CPM.

[20]  Giovanni Manzini,et al.  Engineering a Lightweight Suffix Array Construction Algorithm , 2002, ESA.

[21]  Dong Kyue Kim,et al.  A Fast Algorithm for Constructing Suffix Arrays for Fixed-Size Alphabets , 2004, WEA.

[22]  Jens Stoye,et al.  An incomplex algorithm for fast suffix array construction , 2007, ALENEX/ANALCO.

[23]  Peter Sanders,et al.  Simple Linear Work Suffix Array Construction , 2003, ICALP.

[24]  Philip Bille,et al.  Sparse Suffix Tree Construction in Small Space , 2013, ICALP.

[25]  Juha Kärkkäinen,et al.  LCP Array Construction in External Memory , 2014, SEA.

[26]  Peter Sanders,et al.  STXXL: standard template library for XXL data sets , 2008, Softw. Pract. Exp..

[27]  Giovanni Manzini,et al.  Two Space Saving Tricks for Linear Time LCP Array Computation , 2004, SWAT.

[28]  Peter Sanders,et al.  Better external memory suffix array construction , 2008, JEAL.