论文信息 - Efficient search engine measurements - 字舞流文

Efficient search engine measurements

We address the problem of measuring global quality met-rics of search engines, like corpus size, index freshness, anddensity of duplicates in the corpus. The recently proposedestimators for such metrics [2, 6] suffer from significant biasand/or poor performance, due to inaccurate approximationof the so called .document degrees..We present two new estimators that are able to overcomethe bias introduced by approximate degrees. Our estimatorsare based on a careful implementation of an approximateimportance sampling procedure. Comprehensive theoreti-cal and empirical analysis of the estimators demonstratesthat they have essentially no bias even in situations wheredocument degrees are poorly approximated.Building on an idea from [6], we discuss Rao Blackwelliza-tion as a generic method for reducing variance in searchengine estimators. We show that Rao-Blackwellizing ourestimators results in significant performance improvements,while not compromising accuracy.

Ziv Bar-Yossef | Maxim Gurevich | M. Gurevich | Ziv Bar-Yossef

[1] Jun S. Liu,et al. Monte Carlo strategies in scientific computing , 2001 .

[2] Marc Najork,et al. Measuring Index Quality Using Random Walks on the Web , 1999, Comput. Networks.

[3] C. Lee Giles,et al. Accessibility of information on the Web , 2000, INTL.

[4] Steve Chien,et al. Approximating Aggregate Queries about Web Pages via Random Walks , 2000, VLDB.

[5] Antonio Gulli,et al. The indexable web is more than 11.5 billion pages , 2005, WWW '05.

[6] Andrei Z. Broder,et al. Estimating corpus size via queries , 2006, CIKM '06.

[7] Jun S. Liu,et al. Metropolized independent sampling with comparisons to rejection sampling and importance sampling , 1996, Stat. Comput..

[8] G. Casella,et al. Rao-Blackwellisation of sampling schemes , 1996 .

[9] Ziv Bar-Yossef,et al. Estimating the impressionrank of web pages , 2009, WWW '09.

[10] Oded Goldreich,et al. A Sample of Samplers - A Computational Perspective on Sampling (survey) , 1997, Electron. Colloquium Comput. Complex..

[11] Eric T. Bradlow,et al. The Little Engines That Could: Modeling the Performance of World Wide Web Search Engines , 2000 .

[12] C. D. Kemp,et al. Kendall's Advanced Theory of Statistics, Vol. 1: Distribution Theory. , 1995 .

[13] Andrei Z. Broder,et al. A Technique for Measuring the Relative Size and Overlap of Public Web Search Engines , 1998, Comput. Networks.

[14] David M. Pennock,et al. Methods for Sampling Pages Uniformly from the World Wide Web , 2001 .

[15] K. McCurley,et al. Income Inequality in the Attention Economy , 2007 .

[16] Andrei Z. Broder,et al. Sampling Search-Engine Results , 2005, WWW '05.

[17] Stephen E. Fienberg,et al. How Large Is the WorldWide Web? , 2004, Web Dynamics.

[18] A Chao,et al. Estimating population size via sample coverage for closed capture-recapture models. , 1994, Biometrics.

[19] Ziv Bar-Yossef,et al. Mining search engine query logs via suggestion sampling , 2008, Proc. VLDB Endow..

[20] M. Kendall,et al. Kendall's advanced theory of statistics , 1995 .

[21] W. K. Hastings,et al. Monte Carlo Sampling Methods Using Markov Chains and Their Applications , 1970 .

[22] A. W. Kemp,et al. Kendall's Advanced Theory of Statistics. , 1994 .

[23] N. Metropolis,et al. Equation of State Calculations by Fast Computing Machines , 1953, Resonance.

[24] Víctor Pàmies,et al. Open Directory Project , 2003 .

[25] Ziv Bar-Yossef,et al. Random sampling from a search engine's index , 2006, WWW '06.

[26] Stephen E. Fienberg,et al. How Large Is the World Wide Web , 2004 .

[27] Giles,et al. Searching the world wide Web , 1998, Science.

[28] D. Siegmund. Sequential Analysis: Tests and Confidence Intervals , 1985 .

[29] Monika Henzinger,et al. A Comparison of Techniques for Sampling Web Pages , 2009, STACS.

[30] Marc Najork,et al. On near-uniform URL sampling , 2000, Comput. Networks.

[31] Heikki Mannila,et al. A random walk approach to sampling hidden databases , 2007, SIGMOD '07.