Alphabet-Independent Compressed Text Indexing

Self-indexes are able to represent a text asymptotically within the information-theoretic lower bound under the kth order entropy model and offer access to any text substring and indexed pattern searches. Their time complexities are not optimal, however; in particular, they are always multiplied by a factor that depends on the alphabet size. In this article, we achieve, for the first time, full alphabet independence in the time complexities of self-indexes while retaining space optimality. We also obtain some relevant byproducts.

[1]  Gaston H. Gonnet,et al.  New Indices for Text: Pat Trees and Pat Arrays , 1992, Information Retrieval: Data Structures & Algorithms.

[2]  Enno Ohlebusch,et al.  Computing the longest common prefix array based on the Burrows-Wheeler transform , 2011, J. Discrete Algorithms.

[3]  M. Farach Optimal suffix tree construction with large alphabets , 1997, Proceedings 38th Annual Symposium on Foundations of Computer Science.

[4]  Guy Jacobson,et al.  Space-efficient static trees and graphs , 1989, 30th Annual Symposium on Foundations of Computer Science.

[5]  Gonzalo Navarro,et al.  Alphabet Partitioning for Compressed Rank/Select and Applications , 2010, ISAAC.

[6]  Raffaele Giancarlo,et al.  Boosting textual compression in optimal linear time , 2005, JACM.

[7]  Giovanni Manzini,et al.  Opportunistic data structures with applications , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[8]  Roberto Grossi,et al.  Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching , 2005, SIAM J. Comput..

[9]  S. Srinivasa Rao,et al.  Succinct indexes for strings, binary relations and multi-labeled trees , 2007, SODA '07.

[10]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[11]  Gonzalo Navarro,et al.  Fully compressed suffix trees , 2008, TALG.

[12]  Rajeev Raman,et al.  Succinct indexable dictionaries with applications to encoding k-ary trees and multisets , 2002, SODA '02.

[13]  Esko Ukkonen,et al.  On-line construction of suffix trees , 1995, Algorithmica.

[14]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[15]  Rajeev Raman,et al.  Optimal Trade-Offs for Succinct String Indexes , 2010, ICALP.

[16]  Giovanni Manzini,et al.  An analysis of the Burrows-Wheeler transform , 2001, SODA '99.

[17]  Gonzalo Navarro,et al.  Faster entropy-bounded compressed suffix trees , 2009, Theor. Comput. Sci..

[18]  Veli Mäkinen,et al.  Unified View of Backward Backtracking in Short Read Mapping , 2010, Algorithms and Applications.

[19]  Johannes Fischer,et al.  Wee LCP , 2009, Inf. Process. Lett..

[20]  Wojciech Rytter,et al.  Jewels of stringology : text algorithms , 2002 .

[21]  David Richard Clark,et al.  Compact pat trees , 1998 .

[22]  Gaston H. Gonnet,et al.  Fast text searching for regular expressions or automaton searching on tries , 1996, JACM.

[23]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[24]  Gonzalo Navarro,et al.  Compressed full-text indexes , 2007, CSUR.

[25]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[26]  Gonzalo Navarro,et al.  New Lower and Upper Bounds for Representing Sequences , 2011, ESA.

[27]  Travis Gagie,et al.  Large alphabets and incompressibility , 2005, Inf. Process. Lett..

[28]  Gonzalo Navarro,et al.  Fully-functional succinct trees , 2010, SODA '10.

[29]  S. Srinivasa Rao,et al.  Succinct indexes for strings, binary relations and multilabeled trees , 2011, TALG.

[30]  Siu-Ming Yiu,et al.  Compressed indexing and local alignment of DNA , 2008, Bioinform..

[31]  Meng He,et al.  Indexing Compressed Text , 2003 .

[32]  Wojciech Rytter,et al.  Jewels of stringology , 2002 .

[33]  Alexander Golynski,et al.  Cell probe lower bounds for succinct data structures , 2009, SODA.

[34]  Peter Sanders,et al.  Simple Linear Work Suffix Array Construction , 2003, ICALP.

[35]  Alberto Apostolico,et al.  The Myriad Virtues of Subword Trees , 1985 .

[36]  SadakaneKunihiko Compressed Suffix Trees with Full Functionality , 2007 .

[37]  Sebastiano Vigna,et al.  Theory and Practise of Monotone Minimal Perfect Hashing , 2009, ALENEX.

[38]  Enno Ohlebusch,et al.  Computing Matching Statistics and Maximal Exact Matches on Compressed Full-Text Indexes , 2010, SPIRE.

[39]  Ricardo Baeza-Yates,et al.  Information Retrieval: Data Structures and Algorithms , 1992 .

[40]  Edward M. McCreight,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[41]  Rodrigo González,et al.  Compressed text indexes: From theory to practice , 2007, JEAL.

[42]  Roberto Grossi,et al.  Compressed suffix arrays and suffix trees with applications to text indexing and string matching (extended abstract) , 2000, STOC '00.

[43]  Sebastiano Vigna,et al.  Monotone minimal perfect hashing: searching a sorted table with O(1) accesses , 2009, SODA.

[44]  Roberto Grossi,et al.  High-order entropy-compressed text indexes , 2003, SODA '03.

[45]  Kunihiko Sadakane,et al.  Compressed Suffix Trees with Full Functionality , 2007, Theory of Computing Systems.

[46]  S. Srinivasa Rao,et al.  Rank/select operations on large alphabets: a tool for text indexing , 2006, SODA '06.

[47]  Kunsoo Park,et al.  Dynamic rank/select structures with applications to run-length encoded texts , 2009, Theor. Comput. Sci..

[48]  Kunihiko Sadakane,et al.  New text indexing functionalities of the compressed suffix arrays , 2003, J. Algorithms.

[49]  Sebastiano Vigna,et al.  Theory and practice of monotone minimal perfect hashing , 2011, JEAL.

[50]  Roberto Grossi,et al.  The string B-tree: a new data structure for string search in external memory and its applications , 1999, JACM.

[51]  Kunihiko Sadakane,et al.  Compressed Text Databases with Efficient Query Algorithms Based on the Compressed Suffix Array , 2000, ISAAC.

[52]  L. R. Rasmussen,et al.  In information retrieval: data structures and algorithms , 1992 .

[53]  Gonzalo Navarro,et al.  Compressed representations of sequences and full-text indexes , 2007, TALG.

[54]  John L. Smith Tables , 1969, Neuromuscular Disorders.