Improving the performance of pipelined query processing with skipping—and its comparison to document-wise partitioning

Web search engines need to provide high throughput and short query latency. Recent results show that pipelined query processing over a term-wise partitioned inverted index may have superior throughput. However, the query processing latency and scalability with respect to the collections size are the main challenges associated with this method. In this paper, we evaluate the effect of inverted index skipping on the performance of pipelined query processing. Further, we introduce a novel idea of using Max-Score pruning within pipelined query processing and a new term assignment heuristic, partitioning by Max-Score. Our current results indicate a significant improvement over the state-of-the-art approach and lead to several further optimizations which include dynamic load balancing, intra-query concurrent processing and a hybrid combination between pipelined and non-pipelined execution. Lastly, we show how the state of term-wise partitioning relates to the industry standard document-wise partitioning. Even though there are situations pipelined query processing is advantegous, document-wise partitioning is still the road to follow.

[1]  Svein Erik Bratsberg,et al.  A Combined Semi-pipelined Query Processing Architecture for Distributed Full-Text Retrieval , 2010, WISE.

[2]  Alistair Moffat,et al.  A pipelined architecture for distributed text query evaluation , 2007, Information Retrieval.

[3]  Torsten Suel,et al.  Optimized Inverted List Assignment in Distributed Search Engine Architectures , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[4]  Svein Erik Bratsberg,et al.  Efficient Compressed Inverted Index Skipping for Disjunctive Text-Queries , 2011, ECIR.

[5]  Svein Erik Bratsberg,et al.  Intra-query Concurrent Pipelined Processing for Distributed Full-Text Retrieval , 2012, ECIR.

[6]  W. Bruce Croft,et al.  Optimization strategies for complex queries , 2005, SIGIR '05.

[7]  Berkant Barla Cambazoglu,et al.  A term-based inverted index partitioning model for efficient distributed query processing , 2013, TWEB.

[8]  Alistair Moffat,et al.  Space-Limited Ranked Query Evaluation Using Adaptive Pruning , 2005, WISE.

[9]  Knut Magne Risvik Scaling Internet Search Engines - Methods and Analysis , 2004 .

[10]  Knut Magne Risvik,et al.  Search engines and Web dynamics , 2002, Comput. Networks.

[11]  Henry Tan,et al.  Maguro, a system for indexing and searching over very large text collections , 2013, WSDM.

[12]  William Webber,et al.  Design and Evaluation of a Pipelined Distributed Information Retrieval Architecture , 2007 .

[13]  Alistair Moffat,et al.  Load balancing for term-distributed parallel retrieval , 2006, SIGIR.

[14]  Howard R. Turtle,et al.  Query Evaluation: Strategies and Optimizations , 1995, Inf. Process. Manag..

[15]  Torsten Suel,et al.  Inverted index compression and query processing with optimized document ordering , 2009, WWW '09.

[16]  Svein Erik Bratsberg,et al.  Improving the Performance of Pipelined Query Processing with Skipping , 2012, WISE.

[17]  Fabrizio Silvestri,et al.  Mining query logs to optimize index partitioning in parallel web search engines , 2007, Infoscale.

[18]  Torsten Suel,et al.  Optimizing top-k document retrieval strategies for block-max indexes , 2013, WSDM.

[19]  Svein Erik Bratsberg,et al.  Impact of the Query Model and System Settings on Performance of Distributed Inverted Indexes , 2009 .