Robust result merging using sample-based score estimates

In federated information retrieval, a query is routed to multiple collections and a single answer list is constructed by combining the results. Such metasearch provides a mechanism for locating documents on the hidden Web and, by use of sampling, can proceed even when the collections are uncooperative. However, the similarity scores for documents returned from different collections are not comparable, and, in uncooperative environments, document scores are unlikely to be reported. We introduce a new merging method for uncooperative environments, in which similarity scores for the sampled documents held for each collection are used to estimate global scores for the documents returned per query. This method requires no assumptions about properties such as the retrieval models used. Using experiments on a wide range of collections, we show that in many cases our merging methods are significantly more effective than previous techniques.

[1]  W. Bruce Croft,et al.  Cluster-based language models for distributed retrieval , 1999, SIGIR '99.

[2]  Milad Shokouhi,et al.  Central-Rank-Based Collection Selection in Uncooperative Distributed Information Retrieval , 2007, ECIR.

[3]  Oren Etzioni,et al.  The MetaCrawler architecture for resource aggregation on the Web , 1997 .

[4]  Luo Si,et al.  Unified utility maximization framework for resource selection , 2004, CIKM '04.

[5]  James P. Callan,et al.  Effective retrieval with distributed collections , 1998, SIGIR '98.

[6]  Kalervo Järvelin,et al.  Proceedings of Sheffield SIGIR, 2004, July 25th-29th : the Twenty-Seventh Annual International ACM SIGIR Conference on Research and Development in information Retrieval , 2004 .

[7]  Luis Gravano,et al.  When one sample is not enough: improving text database selection using shrinkage , 2004, SIGMOD '04.

[8]  Milad Shokouhi,et al.  Federated text retrieval from uncooperative overlapped collections , 2007, SIGIR.

[9]  Andrei Z. Broder,et al.  A Technique for Measuring the Relative Size and Overlap of Public Web Search Engines , 1998, Comput. Networks.

[10]  James C. French,et al.  Comparing the performance of collection selection algorithms , 2003, TOIS.

[11]  Javed A. Aslam,et al.  A unified model for metasearch, pooling, and system evaluation , 2003, CIKM '03.

[12]  W. Bruce Croft,et al.  Searching distributed collections with inference networks , 1995, SIGIR '95.

[13]  Justin Zobel,et al.  Collection Selection via Lexicon Inspection , 1997 .

[14]  Luis Gravano,et al.  Precision and recall of GlOSS estimators for database discovery , 1994, Proceedings of 3rd International Conference on Parallel and Distributed Information Systems.

[15]  Anne E. James,et al.  A Two-Phase Sampling Technique to Improve the Accuracy of Text Similarities in the Categorisation of Hidden Web Databases , 2004, WISE.

[16]  Guy Lebanon,et al.  Linear Regression , 2010 .

[17]  James C. French,et al.  Comparing the performance of database selection algorithms , 1999, SIGIR '99.

[18]  Luis Gravano,et al.  GlOSS: text-source discovery over the Internet , 1999, TODS.

[19]  Javed A. Aslam,et al.  Models for metasearch , 2001, SIGIR '01.

[20]  David Hawking,et al.  Merging Results From Isolated Search Engines , 1999, Australasian Database Conference.

[21]  Milad Shokouhi,et al.  Compact Features for Detection of Near-Duplicates in Distributed Retrieval , 2006, SPIRE.

[22]  James P. Callan,et al.  The effectiveness of query expansion for distributed information retrieval , 2001, CIKM '01.

[23]  David J. DeWitt,et al.  Computing PageRank in a Distributed Internet Search Engine System , 2004, VLDB.

[24]  Thorsten Joachims,et al.  Accurately interpreting clickthrough data as implicit feedback , 2005, SIGIR '05.

[25]  W. Bruce Croft,et al.  TREC and Tipster Experiments with Inquery , 1995, Inf. Process. Manag..

[26]  Michel Beigbeder,et al.  A methodology for collection selection in heterogeneous contexts , 2002, Proceedings. International Conference on Information Technology: Coding and Computing.

[27]  Jie Lu,et al.  Pruning long documents for distributed information retrieval , 2002, CIKM '02.

[28]  Oren Etzioni,et al.  Multi-Service Search and Comparison Using the MetaCrawler , 1995 .

[29]  Fabio Crestani,et al.  Adaptive Query-Based Sampling of Distributed Collections , 2006, SPIRE.

[30]  Vipin Kumar,et al.  Expert agreement and content based reranking in a meta search environment using Mearf , 2002, WWW '02.

[31]  Luis Gravano,et al.  Generalizing GlOSS to Vector-Space Databases and Broker Hierarchies , 1995, VLDB.

[32]  Ling Liu,et al.  Distributed query sampling: a quality-conscious approach , 2006, SIGIR '06.

[33]  W. Bruce Croft Combining Approaches to Information Retrieval , 2002 .

[34]  John Dunnion,et al.  ProbFuse: a probabilistic approach to data fusion , 2006, SIGIR.

[35]  Luo Si,et al.  The FedLemur project: Federated search in the real world , 2006 .

[36]  Luo Si,et al.  Using sampled data and regression to merge search engine results , 2002, SIGIR '02.

[37]  David Hawking,et al.  Evaluating sampling methods for uncooperative collections , 2007, SIGIR.

[38]  Oren Etzioni,et al.  Multi-Engine Search and Comparison Using the MetaCrawler , 1995, World Wide Web J..

[39]  Jacques Savoy,et al.  Approaches to collection selection and results merging for distributed information retrieval , 2001, CIKM '01.

[40]  Shengli Wu,et al.  Performance prediction of data fusion for information retrieval , 2006, Inf. Process. Manag..

[41]  Milad Shokouhi,et al.  Distributed Text Retrieval From Overlapping Collections , 2007, ADC.

[42]  Daryl J. D'Souza,et al.  Collection Selection Using n-Term Indexing , 1999, CODAS.

[43]  R. Manmatha,et al.  Modeling score distributions for combining the outputs of search engines , 2001, SIGIR '01.

[44]  William P. Birmingham,et al.  Architecture of a metasearch engine that supports user information needs , 1999, CIKM '99.

[45]  Edward A. Fox,et al.  Combination of Multiple Searches , 1993, TREC.

[46]  Norbert Fuhr,et al.  Evaluating different methods of estimating retrieval quality for resource selection , 2003, SIGIR.

[47]  Garrison W. Cottrell,et al.  Adaptive combination of evidence for information retrieval , 1999 .

[48]  Fabio Crestani,et al.  Adaptive query-based sampling for distributed IR , 2006, SIGIR.

[49]  Jong-Hak Lee,et al.  Analyses of multiple evidence combination , 1997, SIGIR '97.

[50]  James P. Callan,et al.  Query-based sampling of text databases , 2001, TOIS.

[51]  James P. Callan,et al.  Automatic discovery of language models for text databases , 1999, SIGMOD '99.

[52]  Luo Si,et al.  Modeling search engine effectiveness for federated search , 2005, SIGIR '05.

[53]  Norbert Fuhr,et al.  A decision-theoretic approach to database selection in networked IR , 1999, TOIS.

[54]  Milad Shokouhi,et al.  Sample Sizes for Query Probing in Uncooperative Distributed Information Retrieval , 2006, APWeb.

[55]  Steven Garcia,et al.  Access-Ordered Indexes , 2004, ACSC.

[56]  Ziv Bar-Yossef,et al.  Random sampling from a search engine's index , 2006, WWW '06.

[57]  Milad Shokouhi,et al.  Capturing collection size for distributed non-cooperative retrieval , 2006, SIGIR.

[58]  David Hawking,et al.  Result merging strategies for a current news metasearcher , 2003, Inf. Process. Manag..

[59]  Jamie Callan,et al.  DISTRIBUTED INFORMATION RETRIEVAL , 2002 .

[60]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[61]  Luo Si,et al.  A semisupervised learning method to merge search engine results , 2003, TOIS.

[62]  Kwong Bor Ng,et al.  An investigation of the conditions for effective data fusion in information retrieval , 1998 .

[63]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[64]  Georgios Paltoglou,et al.  Results Merging Algorithm Using Multiple Regression Models , 2007, ECIR.

[65]  Luo Si,et al.  The Effect of Database Size Distribution on Resource Selection Algorithms , 2003, Distributed Multimedia Information Retrieval.

[66]  Sheng Wu,et al.  Estimating collection size with logistic regression , 2007, SIGIR.

[67]  Adele E. Howe,et al.  Experiences with selecting search engines using metasearch , 1997, TOIS.

[68]  Shengli Wu,et al.  Result merging methods in distributed information retrieval with overlapping databases , 2007, Information Retrieval.

[69]  Luis Gravano,et al.  STARTS: Stanford proposal for Internet meta-searching , 1997, SIGMOD '97.

[70]  Peter Bailey,et al.  Server selection on the World Wide Web , 2000, DL '00.

[71]  Garrison W. Cottrell,et al.  Fusion Via a Linear Combination of Scores , 1999, Information Retrieval.

[72]  C. Lee Giles,et al.  Inquirus, the NECI Meta Search Engine , 1998, Comput. Networks.

[73]  Luo Si,et al.  A language modeling framework for resource selection and results merging , 2002, CIKM '02.

[74]  Luis Gravano,et al.  The Effectiveness of GlOSS for the Text Database Discovery Problem , 1994, SIGMOD Conference.

[75]  James P. Callan,et al.  Collection selection and results merging with topically organized U.S. patents and TREC data , 2000, CIKM '00.

[76]  Anne E. James,et al.  A two-phase sampling technique for information extraction from hidden web databases , 2004, WIDM '04.

[77]  Luis Gravano,et al.  QProber: A system for automatic classification of hidden-Web databases , 2003, TOIS.

[78]  Daryl J. D'Souza,et al.  Collection selection for managed distributed document databases , 2004, Inf. Process. Manag..