论文信息 - Distributing Efficiently the Block-Max WAND Algorithm

Distributing Efficiently the Block-Max WAND Algorithm

Abstract Large search engines are complex systems composed by several services. Each service is composed by a set of distributed processing nodes dedicated to execute a single operation required to user queries. One of these services is in charge of computing the top- k document results for queries by means of a document ranking operation. This ranking service is a major bottleneck in efficient query processing as billions of documents has to be processed each day. To answer user queries within a fraction of a second, techniques such as the Block-Max WAND algorithm are used to avoid fully processing all documents related to a query. In this work, we propose to efficiently distributing the Block-Max WAND computation among the ranking service processing nodes. Our proposal is devised to reduce memory usage and computation cost by assuming that each one of the P ranking processing nodes provide top- K / P + α documents results, where α is an estimation parameter which is dynamically set for each query. The experimental results show that the proposed approach significantly reduces execution time compared against current approaches used in search engines.

Mauricio Marín | Veronica Gil Costa | Oscar Rojas

[1] Dik Lun Lee,et al. Implementations of Partial Document Ranking Using Inverted Files , 1993, Information Processing & Management.

[2] Peter Elias,et al. Universal codeword sets and representations of the integers , 1975, IEEE Trans. Inf. Theory.

[3] Ricardo Baeza-Yates,et al. Sync/Async parallel search for the efficient design and construction of web search engines , 2010, Parallel Comput..

[4] Fabrizio Silvestri,et al. Assigning identifiers to documents to enhance the clustering property of fulltext indexes , 2004, SIGIR '04.

[5] Alistair Moffat,et al. Inverted Index Compression Using Word-Aligned Binary Codes , 2004, Information Retrieval.

[6] Torsten Suel,et al. Compressing term positions in web indexes , 2009, SIGIR.

[7] Torsten Suel,et al. Improved techniques for result caching in web search engines , 2009, WWW '09.

[8] Hongfei Yan,et al. Optimized top-k processing with global page scores on block-max indexes , 2012, WSDM '12.

[9] Andrei Z. Broder,et al. Efficient query evaluation using a two-level retrieval process , 2003, CIKM '03.

[10] Ronald Fagin,et al. Static index pruning for information retrieval systems , 2001, SIGIR '01.

[11] Ronald Fagin,et al. Combining fuzzy information: an overview , 2002, SGMD.

[12] W. Bruce Croft,et al. Optimization strategies for complex queries , 2005, SIGIR '05.

[13] Torsten Suel,et al. Inverted index compression and query processing with optimized document ordering , 2009, WWW '09.

[14] Roi Blanco,et al. Document Identifier Reassignment Through Dimensionality Reduction , 2005, ECIR.

[15] Guy E. Blelloch,et al. Index compression through document reordering , 2002, Proceedings DCC 2002. Data Compression Conference.

[16] Roi Blanco,et al. Probabilistic static pruning of inverted files , 2010, TOIS.

[17] Alexandros Ntoulas,et al. Pruning policies for two-tiered inverted index with correctness guarantee , 2007, SIGIR.

[18] Torsten Suel,et al. Faster top-k document retrieval using block-max indexes , 2011, SIGIR.

[19] Hugh E. Williams,et al. Compressing Integers for Fast File Access , 1999, Comput. J..

[20] Fabrizio Silvestri,et al. Boosting the performance of Web search engines: Caching and prefetching query results by exploiting historical usage data , 2006, TOIS.

[21] JUSTIN ZOBEL,et al. Inverted files for text search engines , 2006, CSUR.

[22] Alistair Moffat,et al. Binary Interpolative Coding for Effective Index Compression , 2000, Information Retrieval.

[23] Fabrizio Silvestri,et al. Sorting Out the Document Identifier Assignment Problem , 2007, ECIR.

[24] Alistair Moffat,et al. Pruned query evaluation using pre-computed impacts , 2006, SIGIR.

[25] S. Golomb. Run-length encodings. , 1966 .

[26] Alistair Moffat,et al. A pipelined architecture for distributed text query evaluation , 2007, Information Retrieval.

[27] Marcin Zukowski,et al. Super-Scalar RAM-CPU Cache Compression , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[28] Torsten Suel,et al. Optimized Query Execution in Large Search Engines with Global Page Ordering , 2003, VLDB.

[29] Surajit Chaudhuri,et al. Interval-based pruning for top-k processing over compressed lists , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[30] Gerhard Weikum,et al. IO-Top-k: index-access optimized top-k query processing , 2006, VLDB.

[31] Solomon W. Golomb,et al. Run-length encodings (Corresp.) , 1966, IEEE Trans. Inf. Theory.

[32] Tien-Fu Chen,et al. Inverted file compression through document identifier reassignment , 2003, Inf. Process. Manag..