Compressed Suffix Arrays for Massive Data

We present a fast space-efficient algorithm for constructing compressed suffix arrays (CSA). The algorithm requires O (n logn ) time in the worst case, and only O (n ) bits of extra space in addition to the CSA. As the basic step, we describe an algorithm for merging two CSAs. We show that the construction algorithm can be parallelized in a symmetric multiprocessor system, and discuss the possibility of a distributed implementation. We also describe a parallel implementation of the algorithm, capable of indexing several gigabytes per hour.

[1]  Gaston H. Gonnet,et al.  New Indices for Text: Pat Trees and Pat Arrays , 1992, Information Retrieval: Data Structures & Algorithms.

[2]  Joong Chae Na,et al.  Alphabet-independent linear-time construction of compressed suffix arrays using o(nlogn)-bit working space , 2007, Theor. Comput. Sci..

[3]  Paolo Ferragina,et al.  Indexing compressed text , 2005, JACM.

[4]  Enno Ohlebusch,et al.  Replacing suffix trees with enhanced suffix arrays , 2004, J. Discrete Algorithms.

[5]  Eduardo Sany Laber,et al.  LATIN 2008: Theoretical Informatics, 8th Latin American Symposium, Búzios, Brazil, April 7-11, 2008, Proceedings , 2008, Lecture Notes in Computer Science.

[6]  Roberto Grossi,et al.  Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching , 2005, SIAM J. Comput..

[7]  Wing-Kai Hon,et al.  Breaking a time-and-space barrier in constructing full-text indices , 2003, 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings..

[8]  Peter Elias,et al.  Universal codeword sets and representations of the integers , 1975, IEEE Trans. Inf. Theory.

[9]  Peter Sanders,et al.  Scalable Parallel Suffix Array Construction , 2006, PVM/MPI.

[10]  Wojciech Rytter,et al.  Extracting Powers and Periods in a String from Its Runs Structure , 2010, SPIRE.

[11]  Laurent Mouchard,et al.  Dynamic extended suffix arrays , 2010, J. Discrete Algorithms.

[12]  Kunsoo Park,et al.  Dynamic rank/select structures with applications to run-length encoded texts , 2009, Theor. Comput. Sci..

[13]  William F. Smyth,et al.  A taxonomy of suffix array construction algorithms , 2007, CSUR.

[14]  Paolo Ferragina,et al.  A Theoretical and Experimental Study on the Construction of Suffix Arrays in External Memory , 2001, Algorithmica.

[15]  Rodrigo González,et al.  Improved Dynamic Rank-Select Entropy-Bound Structures , 2008, LATIN.

[16]  Wing-Kai Hon,et al.  Compressed indexes for dynamic text collections , 2007, TALG.

[17]  Wolfgang Gerlach Dynamic FM-Index for a Collection of Texts with Application to Space-ecient Construction of the , 2007 .

[18]  Gonzalo Navarro,et al.  Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections , 2008, SPIRE.

[19]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[20]  Peter Sanders,et al.  Better external memory suffix array construction , 2008, JEAL.

[21]  Siu-Ming Yiu,et al.  Practical aspects of Compressed Suffix Arrays and FM-Index in Searching DNA Sequences , 2004, ALENEX/ANALC.

[22]  Gonzalo Navarro,et al.  Storage and Retrieval of Individual Genomes , 2009, RECOMB.

[23]  Gonzalo Navarro,et al.  Dynamic entropy-compressed sequences and full-text indexes , 2006, TALG.

[24]  Kunsoo Park,et al.  Dynamic Rank-Select Structures with Applications to Run-Length Encoded Texts , 2007, CPM.

[25]  Rodrigo González,et al.  Compressed text indexes: From theory to practice , 2007, JEAL.

[26]  Gonzalo Navarro,et al.  Compressed full-text indexes , 2007, CSUR.

[27]  Juha Kärkkäinen,et al.  Fast BWT in small space by blockwise suffix sorting , 2007, Theor. Comput. Sci..

[28]  Siu-Ming Yiu,et al.  A Space and Time Efficient Algorithm for Constructing Compressed Suffix Arrays , 2002, COCOON.

[29]  Ricardo Baeza-Yates,et al.  Information Retrieval: Data Structures and Algorithms , 1992 .

[30]  Kunihiko Sadakane,et al.  Faster suffix sorting , 2007, Theoretical Computer Science.