Topic-based Index Partitions for Efficient and Effective Selective Search

Indexes for large collections are often divided into shards that are distributed across multiple computers and searched in parallel to provide rapid interactive search. Typically, all index shards are searched for each query. This paper investigates document allocation policies that permit searching only a few shards for each query (selective search) without sacrificing search quality. Three types of allocation policies (random, source-based and topic-based) are studied. Kmeans clustering is used to create topic-based shards. We manage the computational cost of applying these techniques to large datasets by defining topics on a subset of the collection. Experiments with three large collections demonstrate that selective search using topic-based shards reduces search costs by at least an order of magnitude without reducing search accuracy.

[1]  W. Bruce Croft,et al.  A Markov random field model for term dependencies , 2005, SIGIR '05.

[2]  H. S. Heaps,et al.  Information retrieval, computational and theoretical aspects , 1978 .

[3]  Fabrizio Silvestri,et al.  Query-driven document partitioning and collection selection , 2006, InfoScale '06.

[4]  W. Bruce Croft,et al.  Searching distributed collections with inference networks , 1995, SIGIR '95.

[5]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[6]  Ricardo Baeza-Yates,et al.  Efficiency trade-offs in two-tier web search systems , 2009, SIGIR.

[7]  James P. Callan,et al.  Collection selection and results merging with topically organized U.S. patents and TREC data , 2000, CIKM '00.

[8]  Charles L. A. Clarke,et al.  Overview of the TREC 2004 Terabyte Track , 2004, TREC.

[9]  W. Bruce Croft,et al.  Cluster-based language models for distributed retrieval , 1999, SIGIR '99.

[10]  Milad Shokouhi,et al.  Central-Rank-Based Collection Selection in Uncooperative Distributed Information Retrieval , 2007, ECIR.

[11]  James P. Callan,et al.  Experiments Using the Lemur Toolkit , 2001, TREC.

[12]  CHENGXIANG ZHAI,et al.  A study of smoothing methods for language models applied to information retrieval , 2004, TOIS.

[13]  W. Bruce Croft,et al.  Combining the language model and inference network approaches to retrieval , 2004, Inf. Process. Manag..

[14]  Luiz André Barroso,et al.  Web Search for a Planet: The Google Cluster Architecture , 2003, IEEE Micro.

[15]  Jamie Callan,et al.  DISTRIBUTED INFORMATION RETRIEVAL , 2002 .