论文信息 - Shard ranking and cutoff estimation for topically partitioned collections

Shard ranking and cutoff estimation for topically partitioned collections

Large document collections can be partitioned into 'topical shards' to facilitate distributed search. In a low-resource search environment only a few of the shards can be searched in parallel. Such a search environment faces two intertwined challenges. First, determining which shards to consult for a given query: shard ranking. Second, how many shards to consult from the ranking: cutoff estimation. In this paper we present a family of three algorithms that address both of these problems. As a basis we employ a commonly used data structure, the central sample index (CSI), to represent the shard contents. Running a query against the CSI yields a flat document ranking that each of our algorithms transforms into a tree structure. A bottom up traversal of the tree is used to infer a ranking of shards and also to estimate a stopping point in this ranking that yields cost-effective selective distributed search. As compared to a state-of-the-art shard ranking approach the proposed algorithms provide substantially higher search efficiency while providing comparable search effectiveness.

[1] Luiz André Barroso,et al. Web Search for a Planet: The Google Cluster Architecture , 2003, IEEE Micro.

[2] W. Bruce Croft,et al. Combining the language model and inference network approaches to retrieval , 2004, Inf. Process. Manag..

[3] W. Bruce Croft,et al. The INQUERY Retrieval System , 1992, DEXA.

[4] W. Bruce Croft,et al. Optimization strategies for complex queries , 2005, SIGIR '05.

[5] Karl Aberer,et al. An Overview of Peer-to-Peer Information Systems , 2002, WDAS.

[6] Berkant Barla Cambazoglu,et al. Query forwarding in geographically distributed search engines , 2010, SIGIR.

[7] Alistair Moffat,et al. A pipelined architecture for distributed text query evaluation , 2007, Information Retrieval.

[8] W. Bruce Croft,et al. A Markov random field model for term dependencies , 2005, SIGIR '05.

[9] Hector Garcia-Molina,et al. Semantic Overlay Networks for P2P Systems , 2004, AP2PC.

[10] Gurmeet Singh Manku,et al. SETS: search enhanced by topic segmentation , 2003, SIGIR.

[11] Milad Shokouhi,et al. SUSHI : Scoring Scaled Samples for Server Selection , 2009 .

[12] Charles L. A. Clarke,et al. Overview of the TREC 2004 Terabyte Track , 2004, TREC.

[13] Jie Lu,et al. Content-Based Peer-to-Peer Network Overlay for Full-Text Federated Search , 2007, RIAO.

[14] Luis Gravano,et al. GlOSS: text-source discovery over the Internet , 1999, TODS.

[15] Robert Krovetz,et al. Viewing morphology as an inference process , 1993, Artif. Intell..

[16] Fabrizio Silvestri,et al. Design of a Parallel and Distributed Web Search Engine , 2004, ArXiv.

[17] Kathryn S. McKinley,et al. Evaluating the performance of distributed architectures for information retrieval using a variety of workloads , 2000, TOIS.

[18] W. Bruce Croft,et al. Cluster-based language models for distributed retrieval , 1999, SIGIR '99.

[19] Knut Magne Risvik,et al. Multi-tier architecture for Web search engines , 2003, Proceedings of the IEEE/LEOS 3rd International Conference on Numerical Simulation of Semiconductor Optoelectronic Devices (IEEE Cat. No.03EX726).

[20] Milad Shokouhi,et al. Central-Rank-Based Collection Selection in Uncooperative Distributed Information Retrieval , 2007, ECIR.

[21] James P. Callan,et al. Document allocation policies for selective searching of distributed indexes , 2010, CIKM '10.

[22] Ricardo Baeza-Yates,et al. Efficiency trade-offs in two-tier web search systems , 2009, SIGIR.

[23] Joemon M. Jose,et al. An Evaluation of a Cluster-Based Architecture for Peer-to-Peer Information Retrieval , 2007, DEXA.

[24] Torsten Suel,et al. ODISSEA: A Peer-to-Peer Architecture for Scalable Web Search and Information Retrieval , 2003, WebDB.

[25] Jamie Callan,et al. DISTRIBUTED INFORMATION RETRIEVAL , 2002 .

[26] Fernando Diaz,et al. Classification-based resource selection , 2009, CIKM.

[27] Fabrizio Silvestri,et al. Query-driven document partitioning and collection selection , 2006, InfoScale '06.

[28] W. Bruce Croft,et al. Searching distributed collections with inference networks , 1995, SIGIR '95.

[29] Ricardo A. Baeza-Yates,et al. Challenges on Distributed Web Retrieval , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[30] Raffaele Perego,et al. Tuning the capacity of search engines: Load-driven routing and incremental caching to reduce and balance the load , 2010, TOIS.

[31] Abdur Chowdhury,et al. Operational requirements for scalable search systems , 2003, CIKM '03.

[32] Luis Gravano,et al. Distributed Search over the Hidden Web: Hierarchical Database Sampling and Selection , 2002, VLDB.