Methods for Sampling Pages Uniformly from the World Wide Web

We present two new algorithms for generating uniformly random samples of pages from the World Wide Web, building upon recent work by Henzinger et al. (Henzinger et al. 2000) and Bar-Yossef et al. (Bar-Yossef et al. 2000). Both algorithms are based on a weighted random-walk methodology. The first algorithm (DIRECTED-SAMPLE) operates on arbitrary directed graphs, and so is naturally applicable to the web. We show that, in the limit, this algorithm generates samples that are uniformly random. The second algorithm (UNDIRECTED-SAMPLE) operates on undirected graphs, thus requiring a mechanism for obtaining inbound links to web pages (e.g., access to a search engine). With this additional knowledge of inbound links, the algorithm can arrive at a uniform distribution faster than DIRECTEDSAMPLE, and we derive explicit bounds on the time to convergence. In addition, we evaluate the two algorithms on simulated web data, showing that both yield reliably uniform samples of pages. We also compare our results with those of previous algorithms, and discuss the theoretical relationships among the various proposed methods.