Linear work suffix array construction

Suffix trees and suffix arrays are widely used and largely interchangeable index structures on strings and sequences. Practitioners prefer suffix arrays due to their simplicity and space efficiency while theoreticians use suffix trees due to linear-time construction algorithms and more explicit structure. We narrow this gap between theory and practice with a simple linear-time construction algorithm for suffix arrays. The simplicity is demonstrated with a C++ implementation of 50 effective lines of code. The algorithm is called DC3, which stems from the central underlying concept of difference cover. This view leads to a generalized algorithm, DC, that allows a space-efficient implementation and, moreover, supports the choice of a space--time tradeoff. For any v ∈ [1,&nradic;], it runs in O(vn) time using O(n/&vradic;) space in addition to the input string and the suffix array. We also present variants of the algorithm for several parallel and hierarchical memory models of computation. The algorithms for BSP and EREW-PRAM models are asymptotically faster than all previous suffix tree or array construction algorithms.

[1]  Wing-Kai Hon,et al.  Constructing Compressed Suffix Arrays with Large Alphabets , 2003, ISAAC.

[2]  Kunihiko Sadakane,et al.  Faster suffix sorting , 2007, Theoretical Computer Science.

[3]  Alberto Bertoni,et al.  Golomb Rulers And Difference Sets For Succinct Quantum Automata , 2003, Int. J. Found. Comput. Sci..

[4]  Juha Kärkkäinen,et al.  Fast BWT in small space by blockwise suffix sorting , 2007, Theor. Comput. Sci..

[5]  Richard Cole,et al.  Parallel merge sort , 1988, 27th Annual Symposium on Foundations of Computer Science (sfcs 1986).

[6]  Arne Andersson,et al.  Suffix Trees on Words , 1996, Algorithmica.

[7]  Martin Farach-Colton,et al.  Optimal Suffix Tree Construction with Large Alphabets , 1997, FOCS.

[8]  Joe Kilian,et al.  The organization of permutation architectures with bussed interconnections , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[9]  Peter Sanders,et al.  Asynchronous parallel disk sorting , 2003, SPAA '03.

[10]  Hiroki Arimura,et al.  Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications , 2001, CPM.

[11]  Roberto Grossi,et al.  Compressed suffix arrays and suffix trees with applications to text indexing and string matching (extended abstract) , 2000, STOC '00.

[12]  Dan Suciu,et al.  Journal of the ACM , 2006 .

[13]  Giovanni Manzini,et al.  Engineering a Lightweight Suffix Array Construction Algorithm , 2002, ESA.

[14]  Wojciech Rytter,et al.  Jewels of stringology : text algorithms , 2002 .

[15]  Roberto Grossi,et al.  Suffix trees and their applications in string algorithms , 1993 .

[16]  Albert Chan,et al.  A Note on Coarse Grained Parallel Integer Sorting , 1999, Parallel Process. Lett..

[17]  William F. Smyth,et al.  The performance of linear time suffix sorting algorithms , 2005, Data Compression Conference.

[18]  Joseph JáJá,et al.  An Introduction to Parallel Algorithms , 1992 .

[19]  Juha Kärkkäinen Suffix Cactus: A Cross between Suffix Tree and Suffix Array , 1995, CPM.

[20]  Roberto Grossi,et al.  The string B-tree: a new data structure for string search in external memory and its applications , 1999, JACM.

[21]  Keith Bostic,et al.  Engineering Radix Sort , 1993, Comput. Syst..

[22]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[23]  Jeffrey Scott Vitter,et al.  Algorithms for parallel memory, I: Two-level memories , 2005, Algorithmica.

[24]  Gaston H. Gonnet,et al.  New Indices for Text: Pat Trees and Pat Arrays , 1992, Information Retrieval: Data Structures & Algorithms.

[25]  Enno Ohlebusch,et al.  Replacing suffix trees with enhanced suffix arrays , 2004, J. Discrete Algorithms.

[26]  Charles E. Leiserson,et al.  Cache-Oblivious Algorithms , 2003, CIAC.

[27]  Juha Kärkkäinen,et al.  Sparse Suffix Trees , 1996, COCOON.

[28]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[29]  Dong Kyue Kim,et al.  Constructing suffix arrays in linear time , 2005, J. Discrete Algorithms.

[30]  Marek J. Sergot,et al.  Distributed and Paged Suffix Trees for Large Genetic Databases , 2003, CPM.

[31]  Srinivas Aluru,et al.  Space efficient linear time construction of suffix arrays , 2003, J. Discrete Algorithms.

[32]  S. Muthukrishnan,et al.  On the sorting-complexity of suffix tree construction , 2000, JACM.

[33]  Giovanni Manzini,et al.  Two Space Saving Tricks for Linear Time LCP Array Computation , 2004, SWAT.

[34]  Peter Sanders,et al.  Better external memory suffix array construction , 2008, JEAL.

[35]  Ian H. Witten,et al.  Managing gigabytes (2nd ed.): compressing and indexing documents and images , 1999 .

[36]  Paolo Ferragina,et al.  A Theoretical and Experimental Study on the Construction of Suffix Arrays in External Memory , 2001, Algorithmica.

[37]  Peter Sanders,et al.  Simple Linear Work Suffix Array Construction , 2003, ICALP.

[38]  Tien-Tsin Wong,et al.  Two new quorum based algorithms for distributed mutual exclusion , 1997, Proceedings of 17th International Conference on Distributed Computing Systems.

[39]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[40]  S. VitterJ.,et al.  Algorithms for parallel memory, I , 1994 .

[41]  William F. Smyth,et al.  Computing Patterns in Strings , 2003 .

[42]  Pang Ko,et al.  Linear Time Construction of Suffix Arrays , 2002 .

[43]  Jeffrey Scott Vitter,et al.  Deterministic distribution sort in shared and distributed memory multiprocessors , 1993, SPAA '93.

[44]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[45]  Sanguthevar Rajasekaran,et al.  Optimal and Sublogarithmic Time Randomized Parallel Sorting Algorithms , 1989, SIAM J. Comput..

[46]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[47]  Giovanni Manzini,et al.  Opportunistic data structures with applications , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[48]  S. Muthukrishnan,et al.  Overcoming the memory bottleneck in suffix tree construction , 1998, Proceedings 39th Annual Symposium on Foundations of Computer Science (Cat. No.98CB36280).

[49]  Joong Chae Na Linear-Time Construction of Compressed Suffix Arrays Using o(n log n)-Bit Working Space for Large Alphabets , 2005, CPM.

[50]  S. Muthukrishnan,et al.  Optimal Logarithmic Time Randomized Suffix Tree Construction , 1996, ICALP.

[51]  Robert Sedgewick,et al.  Fast algorithms for sorting and searching strings , 1997, SODA '97.

[52]  Siu-Ming Yiu,et al.  A Space and Time Efficient Algorithm for Constructing Compressed Suffix Arrays , 2002, COCOON.

[53]  Xerox Polo,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976 .

[54]  Torben Hagerup,et al.  Optimal Merging and Sorting on the Erew Pram , 1989, Inf. Process. Lett..

[55]  S. Sitharama Iyengar,et al.  Introduction to parallel algorithms , 1998, Wiley series on parallel and distributed computing.

[56]  Michael T. Goodrich,et al.  Communication-Efficient Parallel Sorting , 1999, SIAM J. Comput..

[57]  Gonzalo Navarro,et al.  A Hybrid Indexing Method for Approximate String Matching , 2007 .

[58]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[59]  Esko Ukkonen,et al.  On-line construction of suffix trees , 1995, Algorithmica.

[60]  Siu-Ming Yiu,et al.  A Space and Time Efficient Algorithm for Constructing Compressed Suffix Arrays , 2002, Algorithmica.

[61]  Alexandros V. Gerbessiotis,et al.  Merging on the BSP model , 2001, Parallel Comput..

[62]  Jeffrey Scott Vitter,et al.  Greed sort: optimal deterministic sorting on parallel disks , 1995, JACM.

[63]  Rajeev Raman,et al.  Waste makes haste: tight bounds for loose parallel sorting , 1992, Proceedings., 33rd Annual Symposium on Foundations of Computer Science.

[64]  Peter Sanders,et al.  Scalable Parallel Suffix Array Construction , 2006, PVM/MPI.

[65]  Juha Kärkkäinen,et al.  Fast Lightweight Suffix Array Construction and Checking , 2003, CPM.

[66]  Enno Ohlebusch,et al.  Optimal Exact Strring Matching Based on Suffix Arrays , 2002, SPIRE.

[67]  Enno Ohlebusch,et al.  The Enhanced Suffix Array and Its Applications to Genome Analysis , 2002, WABI.

[68]  Wojciech Rytter,et al.  Jewels of stringology , 2002 .

[69]  Hiroki Arimura,et al.  Linear-Time Longest-Common-Prex Computation in Sux Arrays and Its Applications , 2001 .

[70]  Charles J. Colbourn,et al.  Quorums from difference covers , 2000, Inf. Process. Lett..