Collection Selection with Highly Discriminative Keys

The centralized web search paradigm introduces several problems, such as large data traffic requirements for crawling, index freshness problems and problems to index everything. In this study, we look at collection selection using highly discriminative keys and query-driven indexing as part of a distributed web search system. The approach is evaluated on different splits of the TREC WT10g corpus. Experimental results show that the approach outperforms a Dirichlet smoothing language modeling approach for collection selection, if we assume that web servers index their local content.

[1]  W. Bruce Croft,et al.  UMass at TREC 2008 Blog Distillation Task , 2007, TREC.

[2]  Daryl J. D'Souza,et al.  Collection selection for managed distributed document databases , 2004, Inf. Process. Manag..

[3]  Karl Aberer,et al.  Query-driven indexing for scalable peer-to-peer text retrieval , 2007, InfoScale '07.

[4]  Lada A. Adamic,et al.  Zipf's law and the Internet , 2002, Glottometrics.

[5]  S. Bockting Collection Selection for Distributed Web Search , 2009 .

[6]  W. Bruce Croft,et al.  Cluster-based language models for distributed retrieval , 1999, SIGIR '99.

[7]  Milad Shokouhi,et al.  Using query logs to establish vocabularies in distributed information retrieval , 2007, Inf. Process. Manag..

[8]  Sandhya Dwarkadas,et al.  Peer-to-peer information retrieval using self-organizing semantic overlay networks , 2003, SIGCOMM '03.

[9]  Karl Aberer,et al.  Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[10]  Gleb Skobeltsyn,et al.  Query-Driven Indexing in Large-Scale Distributed Systems , 2009 .

[11]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[12]  Karl Aberer,et al.  Using Highly Discriminative Keys for Indexing in a Peer-to-Peer Full Text Retrieval System , 2005 .

[13]  Peter Bailey,et al.  Is it fair to evaluate Web systems using TREC ad hoc methods , 1999, SIGIR 1999.

[14]  Luis Gravano,et al.  Generalizing GlOSS to Vector-Space Databases and Broker Hierarchies , 1995, VLDB.

[15]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[16]  Jamie Callan,et al.  DISTRIBUTED INFORMATION RETRIEVAL , 2002 .

[17]  John D. Lafferty,et al.  A study of smoothing methods for language models applied to Ad Hoc information retrieval , 2001, SIGIR '01.

[18]  Daryl J. D'Souza,et al.  A comparison of techniques for selecting text collections , 2000, Proceedings 11th Australasian Database Conference. ADC 2000 (Cat. No.PR00528).

[19]  W. Bruce Croft,et al.  Indri : A language-model based search engine for complex queries ( extended version ) , 2005 .

[20]  Peter Bailey,et al.  Engineering a multi-purpose test collection for Web retrieval experiments , 2003, Inf. Process. Manag..

[21]  Jaime G. Carbonell,et al.  Document Representation and Query Expansion Models for Blog Recommendation , 2008, ICWSM.

[22]  Peter Bailey,et al.  Server selection on the World Wide Web , 2000, DL '00.

[23]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[24]  Monika Henzinger,et al.  Analysis of a very large web search engine query log , 1999, SIGF.

[25]  Stephen E. Robertson,et al.  GatfordCentre for Interactive Systems ResearchDepartment of Information , 1996 .

[26]  Eric Horvitz,et al.  Patterns of search: analyzing and modeling Web query refinement , 1999 .

[27]  Abdur Chowdhury,et al.  A picture of search , 2006, InfoScale '06.

[28]  W. Bruce Croft,et al.  The INQUERY Retrieval System , 1992, DEXA.

[29]  Daryl J. D'Souza,et al.  Is CORI Effective for Collection Selection? An Exploration of Parameters, Queries, and Data , 2004, ADCS.

[30]  James C. French,et al.  Comparing the performance of collection selection algorithms , 2003, TOIS.

[31]  Fabrizio Silvestri,et al.  Query-driven document partitioning and collection selection , 2006, InfoScale '06.

[32]  W. Bruce Croft,et al.  Searching distributed collections with inference networks , 1995, SIGIR '95.

[33]  Dik Lun Lee,et al.  Server Ranking for Distributed Text Retrieval Systems on the Internet , 1997, DASFAA.

[34]  Minoru Uehara,et al.  Query based site selection for distributed search engines , 2003, 23rd International Conference on Distributed Computing Systems Workshops, 2003. Proceedings..

[35]  Luo Si,et al.  A language modeling framework for resource selection and results merging , 2002, CIKM '02.

[36]  W. Bruce Croft,et al.  Combining the language model and inference network approaches to retrieval , 2004, Inf. Process. Manag..