论文信息 - Evaluating sampling methods for uncooperative collections

Evaluating sampling methods for uncooperative collections

Many server selection methods suitable for distributed information retrieval applications rely, in the absence of cooperation, on the availability of unbiased samples of documents from the constituent collections. We describe a number of sampling methods which depend only on the normal query-response mechanism of the applicable search facilities. We evaluate these methods on a number of collections typical of a personal metasearch application. Results demonstrate that biases exist for all methods, particularly toward longer documents, and that in some cases these biases can be reduced but not eliminated by choice of parameters.We also introduce a new sampling technique, "multiple queries", which produces samples of similar quality to the best current techniques but with significantly reduced cost.

David Hawking | Paul Thomas | D. Hawking | Paul Thomas

[1] James C. French,et al. The impact of database selection on distributed searching , 2000, SIGIR '00.

[2] W. Bruce Croft,et al. Searching distributed collections with inference networks , 1995, SIGIR '95.

[3] King-Lup Liu,et al. Discovering the representative of a search engine , 2001, CIKM '01.

[4] Ziv Bar-Yossef,et al. Random sampling from a search engine's index , 2006, WWW '06.

[5] David M. Pennock,et al. Methods for Sampling Pages Uniformly from the World Wide Web , 2001 .

[6] Antonio Gulli,et al. The indexable web is more than 11.5 billion pages , 2005, WWW '05.

[7] Andrei Z. Broder,et al. Estimating corpus size via queries , 2006, CIKM '06.

[8] Víctor Pàmies,et al. Open Directory Project , 2003 .

[9] Andrei Z. Broder,et al. A Technique for Measuring the Relative Size and Overlap of Public Web Search Engines , 1998, Comput. Networks.

[10] W. Bruce Croft,et al. Cluster-based language models for distributed retrieval , 1999, SIGIR '99.

[11] Marc Najork,et al. On near-uniform URL sampling , 2000, Comput. Networks.

[12] David Hawking,et al. Merging Results From Isolated Search Engines , 1999, Australasian Database Conference.

[13] Milad Shokouhi,et al. Capturing collection size for distributed non-cooperative retrieval , 2006, SIGIR.

[14] Steve Chien,et al. Approximating Aggregate Queries about Web Pages via Random Walks , 2000, VLDB.