Self-indexing Natural Language

Self-indexing is a concept developed for indexing arbitrary strings. It has been enormously successful to reduce the size of the large indexes typically used on strings, namely suffix trees and arrays. Self-indexes represent a string in a space close to its compressed size and provide indexed searching on it. On natural language, a compressed inverted index over the compressed text already provides a reasonable alternative, in space and time, for indexed searching of words and phrases. In this paper we explore the possibility of regarding natural language text as a string of words and applying a self-index to it. There are several challenges involved, such as dealing with a very large alphabet and detaching searchable content from non-searchable presentation aspects in the text. As a result, we show that the self-index requires space very close to that of the best word-based compressors, and that it obtains better search time than inverted indexes (using the same overall space) when searching for phrases.

[1]  Ricardo A. Baeza-Yates,et al.  Experimental Analysis of a Fast Intersection Algorithm for Sorted Sequences , 2005, SPIRE.

[2]  Gonzalo Navarro,et al.  Indexing text using the Ziv-Lempel trie , 2002, J. Discrete Algorithms.

[3]  David A. Huffman,et al.  A method for the construction of minimum-redundancy codes , 1952, Proceedings of the IRE.

[4]  Peter Sanders,et al.  Compressed Inverted Indexes for In-Memory Search Engines , 2008, ALENEX.

[5]  Gonzalo Navarro,et al.  Compressed full-text indexes , 2007, CSUR.

[6]  Ian H. Witten,et al.  Text Compression , 1990, 125 Problems in Text Algorithms.

[7]  Gonzalo Navarro,et al.  Reorganizing compressed text , 2008, SIGIR '08.

[8]  W. Bruce Croft,et al.  Efficient document retrieval in main memory , 2007, SIGIR.

[9]  Giovanni Manzini,et al.  An analysis of the Burrows-Wheeler transform , 2001, SODA '99.

[10]  H. S. Heaps,et al.  Information retrieval, computational and theoretical aspects , 1978 .

[11]  Gonzalo Navarro,et al.  Reducing the Space Requirement of LZ-Index , 2006, CPM.

[12]  Kunihiko Sadakane,et al.  Succinct data structures for flexible text retrieval systems , 2007, J. Discrete Algorithms.

[13]  Z. Galil,et al.  Combinatorial Algorithms on Words , 1985 .

[14]  Gonzalo Navarro,et al.  Lightweight natural language text compression , 2006, Information Retrieval.

[15]  John L. Smith Tables , 1969, Neuromuscular Disorders.

[16]  Gonzalo Navarro,et al.  Word-Based Statistical Compressors as Natural Language Compression Boosters , 2008, Data Compression Conference (dcc 2008).

[17]  Alberto Apostolico,et al.  The Myriad Virtues of Subword Trees , 1985 .

[18]  Alistair Moffat,et al.  In-Place Calculation of Minimum-Redundancy Codes , 1995, WADS.

[19]  Kunihiko Sadakane,et al.  New text indexing functionalities of the compressed suffix arrays , 2003, J. Algorithms.

[20]  Ricardo A. Baeza-Yates,et al.  Fast and flexible word searching on compressed text , 2000, TOIS.

[21]  Niklaus Wirth,et al.  Algorithms and Data Structures , 1989, Lecture Notes in Computer Science.

[22]  Alistair Moffat,et al.  Word‐based text compression , 1989, Softw. Pract. Exp..

[23]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[24]  Gonzalo Navarro,et al.  Compressed representations of sequences and full-text indexes , 2007, TALG.

[25]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[26]  Ian H. Witten,et al.  Managing gigabytes (2nd ed.): compressing and indexing documents and images , 1999 .

[27]  Ricardo A. Baeza-Yates,et al.  Adding Compression to Block Addressing Inverted Indexes , 2000, Information Retrieval.

[28]  Ian H. Witten,et al.  Managing gigabytes 2nd edition , 1999 .

[29]  Guy Jacobson,et al.  Space-efficient static trees and graphs , 1989, 30th Annual Symposium on Foundations of Computer Science.

[30]  Peter Sanders,et al.  Intersection in Integer Inverted Indices , 2007, ALENEX.

[31]  Roberto Grossi,et al.  High-order entropy-compressed text indexes , 2003, SODA '03.

[32]  Ricardo Baeza-Yates,et al.  Block addressing indices for approximate text retrieval , 2000 .

[33]  Alejandro López-Ortiz,et al.  Faster Adaptive Set Intersections for Text Searching , 2006, WEA.

[34]  J. Shane Culpepper,et al.  Compact Set Representation for Information Retrieval , 2007, SPIRE.