Efficient search engine measurements

We address the problem of measuring global quality met-rics of search engines, like corpus size, index freshness, anddensity of duplicates in the corpus. The recently proposedestimators for such metrics [2, 6] suffer from significant biasand/or poor performance, due to inaccurate approximationof the so called .document degrees..We present two new estimators that are able to overcomethe bias introduced by approximate degrees. Our estimatorsare based on a careful implementation of an approximateimportance sampling procedure. Comprehensive theoreti-cal and empirical analysis of the estimators demonstratesthat they have essentially no bias even in situations wheredocument degrees are poorly approximated.Building on an idea from [6], we discuss Rao Blackwelliza-tion as a generic method for reducing variance in searchengine estimators. We show that Rao-Blackwellizing ourestimators results in significant performance improvements,while not compromising accuracy.

[1]  Jun S. Liu,et al.  Monte Carlo strategies in scientific computing , 2001 .

[2]  Marc Najork,et al.  Measuring Index Quality Using Random Walks on the Web , 1999, Comput. Networks.

[3]  C. Lee Giles,et al.  Accessibility of information on the Web , 2000, INTL.

[4]  Steve Chien,et al.  Approximating Aggregate Queries about Web Pages via Random Walks , 2000, VLDB.

[5]  Antonio Gulli,et al.  The indexable web is more than 11.5 billion pages , 2005, WWW '05.

[6]  Andrei Z. Broder,et al.  Estimating corpus size via queries , 2006, CIKM '06.

[7]  Jun S. Liu,et al.  Metropolized independent sampling with comparisons to rejection sampling and importance sampling , 1996, Stat. Comput..

[8]  G. Casella,et al.  Rao-Blackwellisation of sampling schemes , 1996 .

[9]  Ziv Bar-Yossef,et al.  Estimating the impressionrank of web pages , 2009, WWW '09.

[10]  Oded Goldreich,et al.  A Sample of Samplers - A Computational Perspective on Sampling (survey) , 1997, Electron. Colloquium Comput. Complex..

[11]  Eric T. Bradlow,et al.  The Little Engines That Could: Modeling the Performance of World Wide Web Search Engines , 2000 .

[12]  C. D. Kemp,et al.  Kendall's Advanced Theory of Statistics, Vol. 1: Distribution Theory. , 1995 .

[13]  Andrei Z. Broder,et al.  A Technique for Measuring the Relative Size and Overlap of Public Web Search Engines , 1998, Comput. Networks.

[14]  David M. Pennock,et al.  Methods for Sampling Pages Uniformly from the World Wide Web , 2001 .

[15]  K. McCurley,et al.  Income Inequality in the Attention Economy , 2007 .

[16]  Andrei Z. Broder,et al.  Sampling Search-Engine Results , 2005, WWW '05.

[17]  Stephen E. Fienberg,et al.  How Large Is the WorldWide Web? , 2004, Web Dynamics.

[18]  A Chao,et al.  Estimating population size via sample coverage for closed capture-recapture models. , 1994, Biometrics.

[19]  Ziv Bar-Yossef,et al.  Mining search engine query logs via suggestion sampling , 2008, Proc. VLDB Endow..

[20]  M. Kendall,et al.  Kendall's advanced theory of statistics , 1995 .

[21]  W. K. Hastings,et al.  Monte Carlo Sampling Methods Using Markov Chains and Their Applications , 1970 .

[22]  A. W. Kemp,et al.  Kendall's Advanced Theory of Statistics. , 1994 .

[23]  N. Metropolis,et al.  Equation of State Calculations by Fast Computing Machines , 1953, Resonance.

[24]  Víctor Pàmies,et al.  Open Directory Project , 2003 .

[25]  Ziv Bar-Yossef,et al.  Random sampling from a search engine's index , 2006, WWW '06.

[26]  Stephen E. Fienberg,et al.  How Large Is the World Wide Web , 2004 .

[27]  Giles,et al.  Searching the world wide Web , 1998, Science.

[28]  D. Siegmund Sequential Analysis: Tests and Confidence Intervals , 1985 .

[29]  Monika Henzinger,et al.  A Comparison of Techniques for Sampling Web Pages , 2009, STACS.

[30]  Marc Najork,et al.  On near-uniform URL sampling , 2000, Comput. Networks.

[31]  Heikki Mannila,et al.  A random walk approach to sampling hidden databases , 2007, SIGMOD '07.