Random sampling from a search engine's index

We revisit a problem introduced by Bharat and Broder almost a decade ago: How to sample random pages from the corpus of documents indexed by a search engine, using only the search engine's public interfaceq Such a primitive is particularly useful in creating objective benchmarks for search engines. The technique of Bharat and Broder suffers from a well-recorded bias: it favors long documents. In this article we introduce two novel sampling algorithms: a lexicon-based algorithm and a random walk algorithm. Our algorithms produce biased samples, but each sample is accompanied by a weight, which represents its bias. The samples, in conjunction with the weights, are then used to simulate near-uniform samples. To this end, we resort to four well-known Monte Carlo simulation methods: rejection sampling, importance sampling, the Metropolis--Hastings algorithm, and the Maximum Degree method. The limited access to search engines force our algorithms to use bias weights that are only “approximate”. We characterize analytically the effect of approximate bias weights on Monte Carlo methods and conclude that our algorithms are guaranteed to produce near-uniform samples from the search engine's corpus. Our study of approximate Monte Carlo methods could be of independent interest. Experiments on a corpus of 2.4 million documents substantiate our analytical findings and show that our algorithms do not have significant bias towards long documents. We use our algorithms to collect comparative statistics about the corpora of the Google, MSN Search, and Yahooe search engines.

[1]  David M. Pennock,et al.  Methods for Sampling Pages Uniformly from the World Wide Web , 2001 .

[2]  Brian D. Davison The potential of the metasearch engine , 2005, ASIST.

[3]  Nabil Kahale Large Deviation Bounds for Markov Chains , 1997, Comb. Probab. Comput..

[4]  Antonio Gulli,et al.  The indexable web is more than 11.5 billion pages , 2005, WWW '05.

[5]  Stephen E. Fienberg,et al.  How Large Is the World Wide Web , 2004 .

[6]  Andrei Z. Broder,et al.  Estimating corpus size via queries , 2006, CIKM '06.

[7]  D. Siegmund Sequential Analysis: Tests and Confidence Intervals , 1985 .

[8]  D. Aldous On the Markov Chain Simulation Method for Uniform Combinatorial Distributions and Simulated Annealing , 1987, Probability in the Engineering and Informational Sciences.

[9]  M. Kendall,et al.  Kendall's advanced theory of statistics , 1995 .

[10]  Peter Bailey,et al.  Measuring Search Engine Quality , 2001, Information Retrieval.

[11]  Marc Najork,et al.  On near-uniform URL sampling , 2000, Comput. Networks.

[12]  C. D. Kemp,et al.  Kendall's Advanced Theory of Statistics, Vol. 1: Distribution Theory. , 1995 .

[13]  Giles,et al.  Searching the world wide Web , 1998, Science.

[14]  Gene H. Golub,et al.  Matrix computations (3rd ed.) , 1996 .

[15]  C. Lee Giles,et al.  Accessibility of information on the web , 1999, Nature.

[16]  N. Metropolis,et al.  Equation of State Calculations by Fast Computing Machines , 1953, Resonance.

[17]  Rabia Nuray-Turan,et al.  Automatic performance evaluation of Web search engines , 2004, Inf. Process. Manag..

[18]  Stephen P. Boyd,et al.  Fastest Mixing Markov Chain on a Graph , 2004, SIAM Rev..

[19]  Andrei Z. Broder,et al.  A Technique for Measuring the Relative Size and Overlap of Public Web Search Engines , 1998, Comput. Networks.

[20]  Jun S. Liu,et al.  Metropolized independent sampling with comparisons to rejection sampling and importance sampling , 1996, Stat. Comput..

[21]  Eric T. Bradlow,et al.  The Little Engines That Could: Modeling the Performance of World Wide Web Search Engines , 2000 .

[22]  Andrei Z. Broder,et al.  Sampling Search-Engine Results , 2005, WWW '05.

[23]  Persi Diaconis,et al.  What Do We Know about the Metropolis Algorithm? , 1998, J. Comput. Syst. Sci..

[24]  Ziv Bar-Yossef,et al.  Efficient search engine measurements , 2007, WWW '07.

[25]  Marc Najork,et al.  Measuring Index Quality Using Random Walks on the Web , 1999, Comput. Networks.

[26]  Alistair Sinclair,et al.  Algorithms for Random Generation and Counting: A Markov Chain Approach , 1993, Progress in Theoretical Computer Science.

[27]  Michael D. Gordon,et al.  Finding Information on the World Wide Web: The Retrieval Effectiveness of Search Engines , 1999, Inf. Process. Manag..

[28]  W. K. Hastings,et al.  Monte Carlo Sampling Methods Using Markov Chains and Their Applications , 1970 .

[29]  Steve Chien,et al.  Approximating Aggregate Queries about Web Pages via Random Walks , 2000, VLDB.

[30]  Michael L. Nelson,et al.  Agreeing to disagree: search engines and their public interfaces , 2007, JCDL '07.

[31]  Jun S. Liu,et al.  Monte Carlo strategies in scientific computing , 2001 .