论文信息 - Robust result merging using sample-based score estimates

Robust result merging using sample-based score estimates

In federated information retrieval, a query is routed to multiple collections and a single answer list is constructed by combining the results. Such metasearch provides a mechanism for locating documents on the hidden Web and, by use of sampling, can proceed even when the collections are uncooperative. However, the similarity scores for documents returned from different collections are not comparable, and, in uncooperative environments, document scores are unlikely to be reported. We introduce a new merging method for uncooperative environments, in which similarity scores for the sampled documents held for each collection are used to estimate global scores for the documents returned per query. This method requires no assumptions about properties such as the retrieval models used. Using experiments on a wide range of collections, we show that in many cases our merging methods are significantly more effective than previous techniques.

Milad Shokouhi | Justin Zobel

[1] W. Bruce Croft,et al. Cluster-based language models for distributed retrieval , 1999, SIGIR '99.

[2] Milad Shokouhi,et al. Central-Rank-Based Collection Selection in Uncooperative Distributed Information Retrieval , 2007, ECIR.

[3] Oren Etzioni,et al. The MetaCrawler architecture for resource aggregation on the Web , 1997 .

[4] Luo Si,et al. Unified utility maximization framework for resource selection , 2004, CIKM '04.

[5] James P. Callan,et al. Effective retrieval with distributed collections , 1998, SIGIR '98.

[6] Kalervo Järvelin,et al. Proceedings of Sheffield SIGIR, 2004, July 25th-29th : the Twenty-Seventh Annual International ACM SIGIR Conference on Research and Development in information Retrieval , 2004 .

[7] Luis Gravano,et al. When one sample is not enough: improving text database selection using shrinkage , 2004, SIGMOD '04.

[8] Milad Shokouhi,et al. Federated text retrieval from uncooperative overlapped collections , 2007, SIGIR.

[9] Andrei Z. Broder,et al. A Technique for Measuring the Relative Size and Overlap of Public Web Search Engines , 1998, Comput. Networks.

[10] James C. French,et al. Comparing the performance of collection selection algorithms , 2003, TOIS.

[11] Javed A. Aslam,et al. A unified model for metasearch, pooling, and system evaluation , 2003, CIKM '03.

[12] W. Bruce Croft,et al. Searching distributed collections with inference networks , 1995, SIGIR '95.

[13] Justin Zobel,et al. Collection Selection via Lexicon Inspection , 1997 .

[14] Luis Gravano,et al. Precision and recall of GlOSS estimators for database discovery , 1994, Proceedings of 3rd International Conference on Parallel and Distributed Information Systems.

[15] Anne E. James,et al. A Two-Phase Sampling Technique to Improve the Accuracy of Text Similarities in the Categorisation of Hidden Web Databases , 2004, WISE.

[16] Guy Lebanon,et al. Linear Regression , 2010 .

[17] James C. French,et al. Comparing the performance of database selection algorithms , 1999, SIGIR '99.

[18] Luis Gravano,et al. GlOSS: text-source discovery over the Internet , 1999, TODS.

[19] Javed A. Aslam,et al. Models for metasearch , 2001, SIGIR '01.

[20] David Hawking,et al. Merging Results From Isolated Search Engines , 1999, Australasian Database Conference.

[21] Milad Shokouhi,et al. Compact Features for Detection of Near-Duplicates in Distributed Retrieval , 2006, SPIRE.

[22] James P. Callan,et al. The effectiveness of query expansion for distributed information retrieval , 2001, CIKM '01.

[23] David J. DeWitt,et al. Computing PageRank in a Distributed Internet Search Engine System , 2004, VLDB.

[24] Thorsten Joachims,et al. Accurately interpreting clickthrough data as implicit feedback , 2005, SIGIR '05.

[25] W. Bruce Croft,et al. TREC and Tipster Experiments with Inquery , 1995, Inf. Process. Manag..

[26] Michel Beigbeder,et al. A methodology for collection selection in heterogeneous contexts , 2002, Proceedings. International Conference on Information Technology: Coding and Computing.