Distributing Efficiently the Block-Max WAND Algorithm

Abstract Large search engines are complex systems composed by several services. Each service is composed by a set of distributed processing nodes dedicated to execute a single operation required to user queries. One of these services is in charge of computing the top- k document results for queries by means of a document ranking operation. This ranking service is a major bottleneck in efficient query processing as billions of documents has to be processed each day. To answer user queries within a fraction of a second, techniques such as the Block-Max WAND algorithm are used to avoid fully processing all documents related to a query. In this work, we propose to efficiently distributing the Block-Max WAND computation among the ranking service processing nodes. Our proposal is devised to reduce memory usage and computation cost by assuming that each one of the P ranking processing nodes provide top- K / P + α documents results, where α is an estimation parameter which is dynamically set for each query. The experimental results show that the proposed approach significantly reduces execution time compared against current approaches used in search engines.

[1]  Dik Lun Lee,et al.  Implementations of Partial Document Ranking Using Inverted Files , 1993, Information Processing & Management.

[2]  Peter Elias,et al.  Universal codeword sets and representations of the integers , 1975, IEEE Trans. Inf. Theory.

[3]  Ricardo Baeza-Yates,et al.  Sync/Async parallel search for the efficient design and construction of web search engines , 2010, Parallel Comput..

[4]  Fabrizio Silvestri,et al.  Assigning identifiers to documents to enhance the clustering property of fulltext indexes , 2004, SIGIR '04.

[5]  Alistair Moffat,et al.  Inverted Index Compression Using Word-Aligned Binary Codes , 2004, Information Retrieval.

[6]  Torsten Suel,et al.  Compressing term positions in web indexes , 2009, SIGIR.

[7]  Torsten Suel,et al.  Improved techniques for result caching in web search engines , 2009, WWW '09.

[8]  Hongfei Yan,et al.  Optimized top-k processing with global page scores on block-max indexes , 2012, WSDM '12.

[9]  Andrei Z. Broder,et al.  Efficient query evaluation using a two-level retrieval process , 2003, CIKM '03.

[10]  Ronald Fagin,et al.  Static index pruning for information retrieval systems , 2001, SIGIR '01.

[11]  Ronald Fagin,et al.  Combining fuzzy information: an overview , 2002, SGMD.

[12]  W. Bruce Croft,et al.  Optimization strategies for complex queries , 2005, SIGIR '05.

[13]  Torsten Suel,et al.  Inverted index compression and query processing with optimized document ordering , 2009, WWW '09.

[14]  Roi Blanco,et al.  Document Identifier Reassignment Through Dimensionality Reduction , 2005, ECIR.

[15]  Guy E. Blelloch,et al.  Index compression through document reordering , 2002, Proceedings DCC 2002. Data Compression Conference.

[16]  Roi Blanco,et al.  Probabilistic static pruning of inverted files , 2010, TOIS.

[17]  Alexandros Ntoulas,et al.  Pruning policies for two-tiered inverted index with correctness guarantee , 2007, SIGIR.

[18]  Torsten Suel,et al.  Faster top-k document retrieval using block-max indexes , 2011, SIGIR.

[19]  Hugh E. Williams,et al.  Compressing Integers for Fast File Access , 1999, Comput. J..

[20]  Fabrizio Silvestri,et al.  Boosting the performance of Web search engines: Caching and prefetching query results by exploiting historical usage data , 2006, TOIS.

[21]  JUSTIN ZOBEL,et al.  Inverted files for text search engines , 2006, CSUR.

[22]  Alistair Moffat,et al.  Binary Interpolative Coding for Effective Index Compression , 2000, Information Retrieval.

[23]  Fabrizio Silvestri,et al.  Sorting Out the Document Identifier Assignment Problem , 2007, ECIR.

[24]  Alistair Moffat,et al.  Pruned query evaluation using pre-computed impacts , 2006, SIGIR.

[25]  S. Golomb Run-length encodings. , 1966 .

[26]  Alistair Moffat,et al.  A pipelined architecture for distributed text query evaluation , 2007, Information Retrieval.

[27]  Marcin Zukowski,et al.  Super-Scalar RAM-CPU Cache Compression , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[28]  Torsten Suel,et al.  Optimized Query Execution in Large Search Engines with Global Page Ordering , 2003, VLDB.

[29]  Surajit Chaudhuri,et al.  Interval-based pruning for top-k processing over compressed lists , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[30]  Gerhard Weikum,et al.  IO-Top-k: index-access optimized top-k query processing , 2006, VLDB.

[31]  Solomon W. Golomb,et al.  Run-length encodings (Corresp.) , 1966, IEEE Trans. Inf. Theory.

[32]  Tien-Fu Chen,et al.  Inverted file compression through document identifier reassignment , 2003, Inf. Process. Manag..