Out of the Box Phrase Indexing

We present a method for optimizing phrase search based on inverted indexes. Our approach adds selected (two-term) phrases to an existing index. Whereas competing approaches are often based on the analysis of query logs, our approach works out of the box and uses only the information contained in the index. Also, our method is competitive in terms of query performance and can even improve on other approaches for difficult queries. Moreover, our approach gives performance guarantees for arbitrary queries. Further, we propose using a phrase index as a substitute for the positional index of an in-memory search engine working with short documents. We support our conclusions with experiments using a high-performance main-memory search engine. We also give evidence that classical disk based systems can profit from our approach.

[1]  Peter Sanders,et al.  Compressed Inverted Indexes for In-Memory Search Engines , 2008, ALENEX.

[2]  Carl Gutwin,et al.  Improving browsing in digital libraries with keyphrase indexes , 1999, Decis. Support Syst..

[3]  Hugh E. Williams,et al.  Fast phrase querying with combined indexes , 2004, TOIS.

[4]  Amanda Spink,et al.  Searching the Web: the public and their queries , 2001 .

[5]  Peter Sanders,et al.  Intersection in Integer Inverted Indices , 2007, ALENEX.

[6]  Wojciech Rytter,et al.  Extracting Powers and Periods in a String from Its Runs Structure , 2010, SPIRE.

[7]  Clement T. Yu,et al.  A theory of term importance in automatic text analysis , 1974, J. Am. Soc. Inf. Sci..

[8]  Hugh E. Williams,et al.  Optimised Phrase Querying and Browsing in Text Databases , 2001 .

[9]  Donald E. Knuth,et al.  Fast Pattern Matching in Strings , 1977, SIAM J. Comput..

[10]  Hugh E. Williams,et al.  What's Next? Index Structures for Efficient Phrase Querying , 1999, Australasian Database Conference.

[11]  Charles L. A. Clarke,et al.  The TREC 2006 Terabyte Track , 2006, TREC.

[12]  Charles L. A. Clarke,et al.  The TREC 2005 Terabyte Track , 2005, TREC.

[13]  Joel L. Fagan,et al.  The effectiveness of a nonsyntactic approach to automatic phrase indexing for document retrieval , 1989, JASIS.

[14]  Hugh E. Williams,et al.  Efficient phrase querying with an auxiliary index , 2002, SIGIR '02.

[15]  Peter Ingwersen,et al.  Developing a Test Collection for the Evaluation of Integrated Search , 2010, ECIR.

[16]  Hugh E. Williams,et al.  Optimised phrase querying and browsing of large text databases , 2001, Proceedings 24th Australian Computer Science Conference. ACSC 2001.

[17]  Amanda Spink,et al.  Real life information retrieval: a study of user queries on the Web , 1998, SIGF.

[18]  William F. Smyth,et al.  Inverted Files Versus Suffix Arrays for Locating Patterns in Primary Memory , 2006, SPIRE.

[19]  JUSTIN ZOBEL,et al.  Inverted files for text search engines , 2006, CSUR.

[20]  Hugh E. Williams,et al.  Compaction Techniques for Nextword Indexes , 2001, SPIRE.

[21]  Chung Keung Poon,et al.  Efficient Phrase Querying with Common Phrase Index , 2006, ECIR.