Phrase Query Optimization on Inverted Indexes

Phrase queries are a key functionality of modern search engines. Beyond that, they increasingly serve as an important building block for applications such as entity-oriented search, text analytics, and plagiarism detection. Processing phrase queries is costly, though, since positional information has to be kept in the index and all words, including stopwords, need to be considered. We consider an augmented inverted index that indexes selected variable-length multi-word sequences in addition to single words. We study how arbitrary phrase queries can be processed efficiently on such an augmented inverted index. We show that the underlying optimization problem is NP-hard in the general case and describe an exact exponential algorithm and an approximation algorithm to its solution. Experiments on ClueWeb09 and The New York Times with different real-world query workloads examine the practical performance of our methods.

[1]  Paolo Ferragina,et al.  Compressed permuterm index , 2007, SIGIR.

[2]  Surajit Chaudhuri,et al.  Scalable ad-hoc entity extraction from text collections , 2008, Proc. VLDB Endow..

[3]  Sivan Toledo,et al.  Characterizing the Performance of Flash Memory Storage Devices and Its Impact on Algorithm Design , 2008, WEA.

[4]  Hugh E. Williams,et al.  Fast phrase querying with combined indexes , 2004, TOIS.

[5]  Gerhard Weikum,et al.  YAGO2: A Spatially and Temporally Enhanced Knowledge Base from Wikipedia: Extended Abstract , 2013, IJCAI.

[6]  Gerhard Weikum,et al.  A Language Modeling Approach for Temporal Information Needs , 2010, ECIR.

[7]  Srikanta J. Bedathur,et al.  Temporal index sharding for space-time efficiency in archive search , 2011, SIGIR.

[8]  Erez Lieberman Aiden,et al.  Quantitative Analysis of Culture Using Millions of Digitized Books , 2010, Science.

[9]  JUSTIN ZOBEL,et al.  Inverted files for text search engines , 2006, CSUR.

[10]  Sriram Raghavan,et al.  Searching the Web , 2001, ACM Trans. Internet Techn..

[11]  Jean Néraud Elementariness of a finite set of words is co-NP-complete , 1990, RAIRO Theor. Informatics Appl..

[12]  Matthias Hagen,et al.  Towards optimum query segmentation: in doubt without , 2012, CIKM '12.

[13]  Oliver Grau,et al.  How Not to Be Seen - Inpainting Dynamic Objects in Crowded Scenes , 2011 .

[14]  Peter Sanders,et al.  Out of the Box Phrase Indexing , 2008, SPIRE.

[15]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[16]  Peter Fankhauser,et al.  Boilerplate detection using shallow text features , 2010, WSDM '10.

[17]  Amanda Spink,et al.  Searching the Web: the public and their queries , 2001 .

[18]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[19]  Vijay V. Vazirani,et al.  Approximation Algorithms , 2001, Springer Berlin Heidelberg.

[20]  Chung Keung Poon,et al.  Efficient Phrase Querying with Common Phrase Index , 2006, ECIR.

[21]  Charles L. A. Clarke,et al.  Information Retrieval - Implementing and Evaluating Search Engines , 2010 .

[22]  Gerhard Weikum,et al.  Query Relaxation for Entity-Relationship Search , 2011, ESWC.

[23]  ChengXiang Zhai,et al.  Unsupervised query segmentation using clickthrough for information retrieval , 2011, SIGIR '11.

[24]  Srikanta J. Bedathur,et al.  Computing n-gram statistics in MapReduce , 2012, EDBT '13.

[25]  Efstathios Stamatatos Plagiarism detection based on structural information , 2011, CIKM '11.

[26]  Craig MacDonald,et al.  Learning to predict response times for online query scheduling , 2012, SIGIR '12.

[27]  Carsten Stoll Optical reconstruction of detailed animatable human body models , 2009 .

[28]  Martin Theobald,et al.  Top-k query processing in probabilistic databases with non-materialized views , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).