Fast construction of the HYB index

As shown in a series of recent works, the HYB index is an alternative to the inverted index (INV) that enables very fast prefix searches, which in turn is the basis for fast processing of many other types of advanced queries, including autocompletion, faceted search, error-tolerant search, database-style select and join, and semantic search. In this work we show that HYB can be constructed at least as fast as INV, and often up to twice as fast. This is because HYB, by its nature, requires only a half-inversion of the data and allows an efficient in-place instead of the traditional merge-based index construction. We also pay particular attention to the cache efficiency of the in-memory posting accumulation, an issue that has not been addressed in previous work, and show that our simple multilevel posting accumulation scheme yields much fewer cache misses compared to related approaches. Finally, we show that HYB supports fast dynamic index updates more easily than INV.

[1]  Ricardo Baeza-Yates,et al.  A Comparison of Open Source Search Engines , 2007 .

[2]  Alistair Moffat,et al.  In Situ Generation of Compressed Inverted Files , 1995, J. Am. Soc. Inf. Sci..

[3]  Esko Ukkonen,et al.  On-line construction of suffix trees , 1995, Algorithmica.

[4]  Kotagiri Ramamohanarao,et al.  Inverted files versus signature files for text indexing , 1998, TODS.

[5]  Alistair Moffat,et al.  Self-indexing inverted files for fast text retrieval , 1996, TOIS.

[6]  Charles L. A. Clarke,et al.  Hybrid index maintenance for contiguous inverted lists , 2007, Information Retrieval.

[7]  C. Clarke,et al.  Memory Management Strategies for Single-Pass Index Construction in Text Retrieval Systems , 2005 .

[8]  Justin Zobel,et al.  Performance of Data Structures for Small Sets of Strings , 2002, ACSC.

[9]  JUSTIN ZOBEL,et al.  Inverted files for text search engines , 2006, CSUR.

[10]  Donna Harman,et al.  Retrieving Records from a Gigabyte of Text on a Minicomputer Using Statistical Ranking. , 1990 .

[11]  Hannah Bast,et al.  Efficient two-sided error-tolerant search , 2010, KEYS '10.

[12]  Alistair Moffat,et al.  Efficient online index construction for text databases , 2008, TODS.

[13]  Alistair Moffat,et al.  In Situ Generation of Compressed Inverted Files , 1995, J. Am. Soc. Inf. Sci..

[14]  Christos Faloutsos,et al.  Signature files: an access method for documents and its analytical performance evaluation , 1984, TOIS.

[15]  Hugh E. Williams,et al.  In-Place versus Re-Build versus Re-Merge: Index Maintenance Strategies for Text Retrieval Systems , 2004, ACSC.

[16]  Li Fan,et al.  Web caching and Zipf-like distributions: evidence and implications , 1999, IEEE INFOCOM '99. Conference on Computer Communications. Proceedings. Eighteenth Annual Joint Conference of the IEEE Computer and Communications Societies. The Future is Now (Cat. No.99CH36320).

[17]  Ingmar Weber,et al.  The CompleteSearch Engine: Interactive, Efficient, and Towards IR& DB Integration , 2007, CIDR.

[18]  Hannah Bast,et al.  Fast Single-Pass Construction of a Half-Inverted Index , 2009, SPIRE.

[19]  H. Bast,et al.  Fast error-tolerant search on very large texts , 2009, SAC '09.

[20]  Jeffrey Scott Vitter,et al.  On two-dimensional indexability and optimal range search indexing , 1999, PODS '99.

[21]  Ian H. Witten,et al.  Managing gigabytes (2nd ed.): compressing and indexing documents and images , 1999 .

[22]  Paolo Ferragina,et al.  A Theoretical and Experimental Study on the Construction of Suffix Arrays in External Memory , 2001, Algorithmica.

[23]  Andrea C. Arpaci-Dusseau,et al.  USENIX Annual Technical ConferenceUSENIX Association 297 Robust , Portable I / O Scheduling with the Disk Mimic , 2003 .

[24]  Giovanni Manzini,et al.  Indexing compressed text , 2005, JACM.

[25]  Justin Zobel,et al.  Efficient single-pass index construction for text databases , 2003, J. Assoc. Inf. Sci. Technol..

[26]  Divesh Srivastava,et al.  Two-dimensional substring indexing , 2001, J. Comput. Syst. Sci..

[27]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[28]  Ingmar Weber,et al.  Type less, find more: fast autocompletion search with a succinct index , 2006, SIGIR.

[29]  Martin Farach-Colton,et al.  Optimal Suffix Tree Construction with Large Alphabets , 1997, FOCS.

[30]  Meng He,et al.  Indexing Compressed Text , 2003 .

[31]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[32]  Fabian M. Suchanek,et al.  ESTER: efficient search on text, entities, and relations , 2007, SIGIR.

[33]  M. Farach Optimal suffix tree construction with large alphabets , 1997, Proceedings 38th Annual Symposium on Foundations of Computer Science.

[34]  Hugh E. Williams,et al.  In-memory hash tables for accumulating text vocabularies , 2001, Inf. Process. Lett..

[35]  A. Winsor Sampling techniques. , 2000, Nursing times.