Word-based self-indexes for natural language text

The inverted index supports efficient full-text searches on natural language text collections. It requires some extra space over the compressed text that can be traded for search speed. It is usually fast for single-word searches, yet phrase searches require more expensive intersections. In this article we introduce a different kind of index. It replaces the text using essentially the same space required by the compressed text alone (compression ratio around 35%). Within this space it supports not only decompression of arbitrary passages, but efficient word and phrase searches. Searches are orders of magnitude faster than those over inverted indexes when looking for phrases, and still faster on single-word searches when little space is available. Our new indexes are particularly fast at counting the occurrences of words or phrases. This is useful for computing relevance of words or phrases. We adapt self-indexes that succeeded in indexing arbitrary strings within compressed space to deal with large alphabets. Natural language texts are then regarded as sequences of words, not characters, to achieve word-based self-indexes. We design an architecture that separates the searchable sequence from its presentation aspects. This permits applying case folding, stemming, removing stopwords, etc. as is usual on inverted indexes.

[1]  Gonzalo Navarro,et al.  Practical Rank/Select Queries over Arbitrary Sequences , 2008, SPIRE.

[2]  Diego Arroyuelo,et al.  Compressed Self-indices Supporting Conjunctive Queries on Document Collections , 2010, SPIRE.

[3]  Alberto Apostolico,et al.  The Myriad Virtues of Subword Trees , 1985 .

[4]  David A. Huffman,et al.  A method for the construction of minimum-redundancy codes , 1952, Proceedings of the IRE.

[5]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[6]  J. Shane Culpepper,et al.  Compact Set Representation for Information Retrieval , 2007, SPIRE.

[7]  R. González,et al.  PRACTICAL IMPLEMENTATION OF RANK AND SELECT QUERIES , 2005 .

[8]  JUSTIN ZOBEL,et al.  Inverted files for text search engines , 2006, CSUR.

[9]  Guy Jacobson,et al.  Space-efficient static trees and graphs , 1989, 30th Annual Symposium on Foundations of Computer Science.

[10]  Kunihiko Sadakane,et al.  Faster suffix sorting , 2007, Theoretical Computer Science.

[11]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[12]  Giovanni Manzini,et al.  Indexing compressed text , 2005, JACM.

[13]  J. Shane Culpepper,et al.  Enhanced Byte Codes with Restricted Prefix Properties , 2005, SPIRE.

[14]  Claire Mathieu,et al.  Adaptive intersection and t-threshold problems , 2002, SODA '02.

[15]  R BrisaboaNieves,et al.  Word-based self-indexes for natural language text , 2012 .

[16]  Gonzalo Navarro,et al.  Succinct Suffix Arrays based on Run-Length Encoding , 2005, Nord. J. Comput..

[17]  Gonzalo Navarro,et al.  Self-indexing Natural Language , 2008, SPIRE.

[18]  Kunihiko Sadakane,et al.  A Linear-Time Burrows-Wheeler Transform Using Induced Sorting , 2009, SPIRE.

[19]  Roberto Grossi,et al.  Compressed suffix arrays and suffix trees with applications to text indexing and string matching (extended abstract) , 2000, STOC '00.

[20]  H. S. Heaps,et al.  Information retrieval, computational and theoretical aspects , 1978 .

[21]  Ricardo A. Baeza-Yates,et al.  Adding Compression to Block Addressing Inverted Indexes , 2000, Information Retrieval.

[22]  V. Vinay,et al.  Proceedings of the 16th Conference on Foundations of Software Technology and Theoretical Computer Science , 1996 .

[23]  Rajeev Raman,et al.  Succinct indexable dictionaries with applications to encoding k-ary trees and multisets , 2002, SODA '02.

[24]  Paolo Ferragina,et al.  Text Compression , 2009, Encyclopedia of Database Systems.

[25]  Gonzalo Navarro,et al.  Compressed full-text indexes , 2007, CSUR.

[26]  Nieves R. Brisaboa,et al.  Compressed String Dictionaries , 2011, SEA.

[27]  Ian H. Witten,et al.  Managing gigabytes (2nd ed.): compressing and indexing documents and images , 1999 .

[28]  Wing-Kai Hon,et al.  Geometric Burrows-Wheeler Transform: Linking Range Searching and Text Indexing , 2008, Data Compression Conference (dcc 2008).

[29]  G. Navarro,et al.  A Compressed Text Index on Secondary Memory ∗ , 2007 .

[30]  Ricardo A. Baeza-Yates,et al.  A Fast Set Intersection Algorithm for Sorted Sequences , 2004, CPM.

[31]  Gonzalo Navarro,et al.  Compressed representations of sequences and full-text indexes , 2007, TALG.

[32]  Claire Mathieu,et al.  Alternation and redundancy analysis of the intersection problem , 2008, TALG.

[33]  Alistair Moffat,et al.  Hybrid bitvector index compression , 2007 .

[34]  Kunihiko Sadakane,et al.  Succinct data structures for flexible text retrieval systems , 2007, J. Discrete Algorithms.

[35]  Gonzalo Navarro,et al.  Colored range queries and document retrieval , 2010, Theor. Comput. Sci..

[36]  Gonzalo Navarro,et al.  Lightweight natural language text compression , 2006, Information Retrieval.

[37]  Erik D. Demaine,et al.  Adaptive set intersections, unions, and differences , 2000, SODA '00.

[38]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[39]  J. Shane Culpepper,et al.  Top-k Ranked Document Search in General Text Databases , 2010, ESA.

[40]  Ian H. Witten,et al.  Text Compression , 1990, 125 Problems in Text Algorithms.

[41]  Alistair Moffat,et al.  Word‐based text compression , 1989, Softw. Pract. Exp..

[42]  Wing-Kai Hon,et al.  Breaking a Time-and-Space Barrier in Constructing Full-Text Indices , 2009, SIAM J. Comput..

[43]  William F. Smyth,et al.  Inverted Files Versus Suffix Arrays for Locating Patterns in Primary Memory , 2006, SPIRE.

[44]  J. Shane Culpepper,et al.  Efficient set intersection for inverted indexing , 2010, TOIS.

[45]  Gonzalo Navarro,et al.  Dynamic entropy-compressed sequences and full-text indexes , 2006, TALG.

[46]  John L. Smith Tables , 1969, Neuromuscular Disorders.

[47]  Wing-Kai Hon,et al.  Space-Efficient Framework for Top-k String Retrieval Problems , 2009, 2009 50th Annual IEEE Symposium on Foundations of Computer Science.

[48]  Rodrigo González,et al.  Improved Dynamic Rank-Select Entropy-Bound Structures , 2008, LATIN.

[49]  W. Bruce Croft,et al.  Efficient document retrieval in main memory , 2007, SIGIR.

[50]  Giovanni Manzini,et al.  An analysis of the Burrows-Wheeler transform , 2001, SODA '99.

[51]  Alistair Moffat,et al.  On the implementation of minimum redundancy prefix codes , 1997, IEEE Trans. Commun..

[52]  Ricardo A. Baeza-Yates,et al.  Experimental Analysis of a Fast Intersection Algorithm for Sorted Sequences , 2005, SPIRE.

[53]  Alejandro López-Ortiz,et al.  An experimental investigation of set intersection algorithms for text searching , 2010, JEAL.

[54]  Peter Sanders,et al.  Engineering basic algorithms of an in-memory text search engine , 2010, TOIS.

[55]  Kunihiko Sadakane,et al.  New text indexing functionalities of the compressed suffix arrays , 2003, J. Algorithms.

[56]  Ricardo A. Baeza-Yates,et al.  Fast and flexible word searching on compressed text , 2000, TOIS.

[57]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[58]  Ricardo Baeza-Yates,et al.  Block addressing indices for approximate text retrieval , 2000 .

[59]  Alejandro López-Ortiz,et al.  Faster Adaptive Set Intersections for Text Searching , 2006, WEA.

[60]  Rodrigo González,et al.  Compressed text indexes: From theory to practice , 2007, JEAL.

[61]  Jan Platos,et al.  Word-Based Text Compression , 2008, ArXiv.

[62]  Giovanni Manzini,et al.  Opportunistic data structures with applications , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[63]  Roberto Grossi,et al.  Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching , 2005, SIAM J. Comput..

[64]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[65]  Alistair Moffat,et al.  Improving suffix array locality for fast pattern matching on disk , 2008, SIGMOD Conference.

[66]  R. K. Wiersba Review of "Information Retrieval: Computational and Theoretical Aspects, by H. S. Heaps", Academic Press Inc. , 1980, SIGF.

[67]  Ricardo A. Baeza-Yates,et al.  Block addressing indices for approximate text retrieval , 1997, CIKM '97.

[68]  Peter Sanders,et al.  Intersection in Integer Inverted Indices , 2007, ALENEX.

[69]  Roberto Grossi,et al.  High-order entropy-compressed text indexes , 2003, SODA '03.

[70]  Ricardo Baeza-Yates,et al.  Modeling Text Databases , 2005 .

[71]  Hugh E. Williams,et al.  Compressing Integers for Fast File Access , 1999, Comput. J..