Shard ranking and cutoff estimation for topically partitioned collections

Large document collections can be partitioned into 'topical shards' to facilitate distributed search. In a low-resource search environment only a few of the shards can be searched in parallel. Such a search environment faces two intertwined challenges. First, determining which shards to consult for a given query: shard ranking. Second, how many shards to consult from the ranking: cutoff estimation. In this paper we present a family of three algorithms that address both of these problems. As a basis we employ a commonly used data structure, the central sample index (CSI), to represent the shard contents. Running a query against the CSI yields a flat document ranking that each of our algorithms transforms into a tree structure. A bottom up traversal of the tree is used to infer a ranking of shards and also to estimate a stopping point in this ranking that yields cost-effective selective distributed search. As compared to a state-of-the-art shard ranking approach the proposed algorithms provide substantially higher search efficiency while providing comparable search effectiveness.

[1]  Luiz André Barroso,et al.  Web Search for a Planet: The Google Cluster Architecture , 2003, IEEE Micro.

[2]  W. Bruce Croft,et al.  Combining the language model and inference network approaches to retrieval , 2004, Inf. Process. Manag..

[3]  W. Bruce Croft,et al.  The INQUERY Retrieval System , 1992, DEXA.

[4]  W. Bruce Croft,et al.  Optimization strategies for complex queries , 2005, SIGIR '05.

[5]  Karl Aberer,et al.  An Overview of Peer-to-Peer Information Systems , 2002, WDAS.

[6]  Berkant Barla Cambazoglu,et al.  Query forwarding in geographically distributed search engines , 2010, SIGIR.

[7]  Alistair Moffat,et al.  A pipelined architecture for distributed text query evaluation , 2007, Information Retrieval.

[8]  W. Bruce Croft,et al.  A Markov random field model for term dependencies , 2005, SIGIR '05.

[9]  Hector Garcia-Molina,et al.  Semantic Overlay Networks for P2P Systems , 2004, AP2PC.

[10]  Gurmeet Singh Manku,et al.  SETS: search enhanced by topic segmentation , 2003, SIGIR.

[11]  Milad Shokouhi,et al.  SUSHI : Scoring Scaled Samples for Server Selection , 2009 .

[12]  Charles L. A. Clarke,et al.  Overview of the TREC 2004 Terabyte Track , 2004, TREC.

[13]  Jie Lu,et al.  Content-Based Peer-to-Peer Network Overlay for Full-Text Federated Search , 2007, RIAO.

[14]  Luis Gravano,et al.  GlOSS: text-source discovery over the Internet , 1999, TODS.

[15]  Robert Krovetz,et al.  Viewing morphology as an inference process , 1993, Artif. Intell..

[16]  Fabrizio Silvestri,et al.  Design of a Parallel and Distributed Web Search Engine , 2004, ArXiv.

[17]  Kathryn S. McKinley,et al.  Evaluating the performance of distributed architectures for information retrieval using a variety of workloads , 2000, TOIS.

[18]  W. Bruce Croft,et al.  Cluster-based language models for distributed retrieval , 1999, SIGIR '99.

[19]  Knut Magne Risvik,et al.  Multi-tier architecture for Web search engines , 2003, Proceedings of the IEEE/LEOS 3rd International Conference on Numerical Simulation of Semiconductor Optoelectronic Devices (IEEE Cat. No.03EX726).

[20]  Milad Shokouhi,et al.  Central-Rank-Based Collection Selection in Uncooperative Distributed Information Retrieval , 2007, ECIR.

[21]  James P. Callan,et al.  Document allocation policies for selective searching of distributed indexes , 2010, CIKM '10.

[22]  Ricardo Baeza-Yates,et al.  Efficiency trade-offs in two-tier web search systems , 2009, SIGIR.

[23]  Joemon M. Jose,et al.  An Evaluation of a Cluster-Based Architecture for Peer-to-Peer Information Retrieval , 2007, DEXA.

[24]  Torsten Suel,et al.  ODISSEA: A Peer-to-Peer Architecture for Scalable Web Search and Information Retrieval , 2003, WebDB.

[25]  Jamie Callan,et al.  DISTRIBUTED INFORMATION RETRIEVAL , 2002 .

[26]  Fernando Diaz,et al.  Classification-based resource selection , 2009, CIKM.

[27]  Fabrizio Silvestri,et al.  Query-driven document partitioning and collection selection , 2006, InfoScale '06.

[28]  W. Bruce Croft,et al.  Searching distributed collections with inference networks , 1995, SIGIR '95.

[29]  Ricardo A. Baeza-Yates,et al.  Challenges on Distributed Web Retrieval , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[30]  Raffaele Perego,et al.  Tuning the capacity of search engines: Load-driven routing and incremental caching to reduce and balance the load , 2010, TOIS.

[31]  Abdur Chowdhury,et al.  Operational requirements for scalable search systems , 2003, CIKM '03.

[32]  Luis Gravano,et al.  Distributed Search over the Hidden Web: Hierarchical Database Sampling and Selection , 2002, VLDB.