Indexing and Compressing Text

Information retrieval is the computational discipline that deals with the efficient representation, organization and access to information objects that represent natural language texts (Salton & McGill, 1983; Witten, Moffat, & Bell, 1999; Baeza-Yates & Ribeiro-Neto, 1999). A crucial subproblem in the Information Retrieval area is the design and implementation of efficient data structures and algorithms for indexing and searching information objects that are vaguely described. In this article, we are going to present the latest developments in the indexing area by giving special emphasis to: data structures and algorithmic techniques for string manipulation, space efficient implementations and compression techniques for efficient storage of information objects. The aforementioned problems appear in a series of applications as digital libraries, molecular sequence databases (DNA sequences, protein databases (Gusfield, 1997)), implementation of Web Search Engines, Web Mining and Information Filtering.

[1]  Jeffrey Scott Vitter,et al.  External memory algorithms and data structures: dealing with massive data , 2001, CSUR.

[2]  Christos Faloutsos,et al.  Access methods for text , 1985, CSUR.

[3]  Rudolf Bayer,et al.  Organization and maintenance of large ordered indexes , 1972, Acta Informatica.

[4]  M. Farach Optimal suffix tree construction with large alphabets , 1997, Proceedings 38th Annual Symposium on Foundations of Computer Science.

[5]  Giovanni Manzini,et al.  Opportunistic data structures with applications , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[6]  Torsten Suel,et al.  Compact full-text indexing of versioned document collections , 2009, CIKM.

[7]  George Havas,et al.  Perfect Hashing , 1997, Theor. Comput. Sci..

[8]  Torsten Suel,et al.  Compressing term positions in web indexes , 2009, SIGIR.

[9]  Edward M. McCreight,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[10]  Gonzalo Navarro,et al.  An Alphabet-Friendly FM-Index , 2004, SPIRE.

[11]  Diego Arroyuelo,et al.  Document identifier reassignment and run-length-compressed inverted indexes for improved search performance , 2013, SIGIR.

[12]  Gonzalo Navarro,et al.  Indexing text using the Ziv-Lempel trie , 2002, J. Discrete Algorithms.

[13]  Ravi Kumar,et al.  Compressed web indexes , 2009, WWW '09.

[14]  Panayiotis Bozanis,et al.  Positional Data Organization and Compression in Web Inverted Indexes , 2012, DEXA.

[15]  Christos Makris,et al.  Wavelet trees: A survey , 2012, Comput. Sci. Inf. Syst..

[16]  Jeffrey Scott Vitter,et al.  Optimal dynamic interval management in external memory , 1996, Proceedings of 37th Conference on Foundations of Computer Science.

[17]  Wing-Kai Hon,et al.  Space-Efficient Framework for Top-k String Retrieval Problems , 2009, 2009 50th Annual IEEE Symposium on Foundations of Computer Science.

[18]  Gonzalo Navarro,et al.  Space-efficient construction of Lempel-Ziv compressed text indexes , 2011, Inf. Comput..

[19]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[20]  Donald E. Knuth,et al.  Fast Pattern Matching in Strings , 1977, SIAM J. Comput..

[21]  Alexandros Ntoulas,et al.  Pruning policies for two-tiered inverted index with correctness guarantee , 2007, SIGIR.

[22]  Roberto Grossi,et al.  The string B-tree: a new data structure for string search in external memory and its applications , 1999, JACM.

[23]  Esko Ukkonen,et al.  On-line construction of suffix trees , 1995, Algorithmica.

[24]  Kotagiri Ramamohanarao,et al.  Inverted files versus signature files for text indexing , 1998, TODS.

[25]  Gonzalo Navarro,et al.  Dual-Sorted Inverted Lists , 2010, SPIRE.

[26]  Dong Kyue Kim,et al.  Linear-Time Construction of Suffix Arrays , 2003, CPM.

[27]  Roberto Grossi,et al.  Compressed suffix arrays and suffix trees with applications to text indexing and string matching (extended abstract) , 2000, STOC '00.

[28]  Giovanni Manzini,et al.  Indexing compressed text , 2005, JACM.

[29]  Alistair Moffat,et al.  Binary Interpolative Coding for Effective Index Compression , 2000, Information Retrieval.

[30]  Srinivas Aluru,et al.  Space efficient linear time construction of suffix arrays , 2003, J. Discrete Algorithms.

[31]  Kurt Mehlhorn,et al.  Maintaining dynamic sequences under equality tests in polylogarithmic time , 1994, SODA '94.

[32]  Robert S. Boyer,et al.  A fast string searching algorithm , 1977, CACM.

[33]  Paolo Ferragina Incremental Text Editing: A New Data Structure , 1994, ESA.

[34]  Marcin Zukowski,et al.  Super-Scalar RAM-CPU Cache Compression , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[35]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[36]  Raffaele Giancarlo,et al.  Boosting textual compression in optimal linear time , 2005, JACM.

[37]  Ricardo A. Baeza-Yates,et al.  Hierarchies of Indices for Text Searching , 1994, Inf. Syst..

[38]  Gonzalo Navarro,et al.  Compressed full-text indexes , 2007, CSUR.

[39]  Alfred V. Aho,et al.  Efficient string matching , 1975, Commun. ACM.

[40]  Kunihiko Sadakane,et al.  New text indexing functionalities of the compressed suffix arrays , 2003, J. Algorithms.