Approximating Aggregate Queries about Web Pages via Random Walks

We present a random walk as an eAEcient and accurate approach to approximating certain aggregate queries about web pages. Our method uses a novel random walk to produce an almost uniformly distributed sample of web pages. The walk traverses a dynamically built regular undirected graph. Queries we have estimated using this method include the coverage of search engines, the proportion of pages belonging to .com and other domains, and the average size of web pages. Strong experimental evidence suggests that our walk produces accurate results quickly using very limited resources.

[1]  D. Aldous On the Markov Chain Simulation Method for Uniform Combinatorial Distributions and Simulated Annealing , 1987, Probability in the Engineering and Informational Sciences.

[2]  Nabil Kahale Large Deviation Bounds for Markov Chains , 1997, Comb. Probab. Comput..

[3]  Andrei Z. Broder,et al.  A Technique for Measuring the Relative Size and Overlap of Public Web Search Engines , 1998, Comput. Networks.

[4]  D. Gillman A Chernoff bound for random walks on expander graphs , 1998, Proceedings of 1993 IEEE 34th Annual Foundations of Computer Science.

[5]  Giles,et al.  Searching the world wide Web , 1998, Science.

[6]  Marc Najork,et al.  Measuring Index Quality Using Random Walks on the Web , 1999, Comput. Networks.

[7]  C. Lee Giles,et al.  Accessibility of information on the Web , 2000, INTL.

[8]  Marc Najork,et al.  On near-uniform URL sampling , 2000, Comput. Networks.