论文信息 - Estimating corpus size via queries

Estimating corpus size via queries

We consider the problem of estimating the size of a collection of documents using only a standard query interface. Our main idea is to construct an unbiased and low-variance estimator that can closely approximate the size of any set of documents defined by certain conditions, including that each document in the set must match at least one query from a uniformly sampleable query pool of known size, fixed in advance.Using this basic estimator, we propose two approaches to estimating corpus size. The first approach requires a uniform random sample of documents from the corpus. The second approach avoids this notoriously difficult sample generation problem, and instead uses two fairly uncorrelated sets of terms as query pools; the accuracy of the second approach depends on the degree of correlation among the two sets of terms.Experiments on a large TREC collection and on three major search engines demonstrates the effectiveness of our algorithms.

[1] Andrei Z. Broder,et al. A Technique for Measuring the Relative Size and Overlap of Public Web Search Engines , 1998, Comput. Networks.

[2] Giles,et al. Searching the world wide Web , 1998, Science.

[3] Andrei Z. Broder,et al. Mirror, Mirror on the Web: A Study of Host Pairs with Replicated Content , 1999, Comput. Networks.

[4] C. Lee Giles,et al. Accessibility of information on the web , 1999, Nature.

[5] Andrei Z. Broder,et al. A Comparison of Techniques to Find Mirrored Hosts on the WWW , 2000, IEEE Data Eng. Bull..

[6] Steve Chien,et al. Approximating Aggregate Queries about Web Pages via Random Walks , 2000, VLDB.

[7] Marc Najork,et al. On near-uniform URL sampling , 2000, Comput. Networks.

[8] David M. Pennock,et al. Methods for Sampling Pages Uniformly from the World Wide Web , 2001 .

[9] King-Lup Liu,et al. Discovering the representative of a search engine , 2001, CIKM '01.

[10] James P. Callan,et al. Query-based sampling of text databases , 2001, TOIS.

[11] Shengli Wu,et al. Experiments with Document Archive Size Detection , 2003, ECIR.

[12] Antonio Gulli,et al. The indexable web is more than 11.5 billion pages , 2005, WWW '05.

[13] Andrei Z. Broder,et al. Sampling Search-Engine Results , 2005, WWW '05.

[14] Ziv Bar-Yossef,et al. Random sampling from a search engine's index , 2006, WWW '06.