RankMass Crawler: A Crawler with High PageRank Coverage Guarantee

Crawling algorithms have been the subject of extensive research and optimizations, but some important questions remain open. In particular, given the infinite number of pages available on the Web, search-engine operators constantly struggle with the following vexing questions: When can I stop downloading the Web? How many pages should I download to cover “most” of the Web? How can I know I am not missing an important part when I stop? In this paper we provide an answer to these questions by developing a family of crawling algorithms that (1) provide a theoretical guarantee on how much of the “important” part of the Web it will download after crawling a certain number of pages and (2) give a high priority to important pages during a crawl, so that the search engine can index the most important part of the Web first. We prove the correctness of our algorithms by theoretical analysis and evaluate their performance experimentally based on 141 million URLs obtained from the Web. Our experiments demonstrate that even our simple algorithm is effective in downloading important pages early on and provides high “coverage” of the Web with a relatively small number of pages.

[1]  Marco Gori,et al.  Focused Crawling Using Context Graphs , 2000, VLDB.

[2]  Philip S. Yu,et al.  Optimal crawling strategies for web search engines , 2002, WWW '02.

[3]  Gerhard Weikum,et al.  Efficient and decentralized PageRank approximation in a peer-to-peer web search network , 2006, VLDB.

[4]  Hector Garcia-Molina,et al.  Efficient Crawling Through URL Ordering , 1998, Comput. Networks.

[5]  Hans-Peter Kriegel,et al.  Accurate and Efficient Crawling for Relevant Websites , 2004, VLDB.

[6]  Serge Abiteboul,et al.  Adaptive on-line page importance computation , 2003, WWW '03.

[7]  Sriram Raghavan,et al.  Crawling the Hidden Web , 2001, VLDB.

[8]  Michael Brinkmeier,et al.  PageRank revisited , 2006, TOIT.

[9]  Gene H. Golub,et al.  Extrapolation methods for accelerating PageRank computations , 2003, WWW '03.

[10]  Hector Garcia-Molina,et al.  Combating Web Spam with TrustRank , 2004, VLDB.

[11]  Ziv Bar-Yossef,et al.  Random sampling from a search engine's index , 2006, WWW '06.

[12]  Brian D. Davison,et al.  Topical TrustRank: using topicality to combat web spam , 2006, WWW '06.

[13]  Luca Becchetti,et al.  The distribution of pageRank follows a power-law only for particular values of the damping factor , 2006, WWW '06.

[14]  Hector Garcia-Molina,et al.  Parallel crawlers , 2002, WWW.

[15]  Sandeep Pandey,et al.  User-centric Web crawling , 2005, WWW '05.

[16]  Taher H. Haveliwala Topic-sensitive PageRank , 2002, IEEE Trans. Knowl. Data Eng..

[17]  Jennifer Widom,et al.  Scaling personalized web search , 2003, WWW '03.

[18]  Soumen Chakrabarti,et al.  Accelerated focused crawling through online relevance feedback , 2002, WWW.

[19]  Marc Najork,et al.  Measuring Index Quality Using Random Walks on the Web , 1999, Comput. Networks.

[20]  Franco Scarselli,et al.  Inside PageRank , 2005, TOIT.

[21]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[22]  David J. DeWitt,et al.  Computing PageRank in a Distributed Internet Search Engine System , 2004, VLDB.

[23]  Ricardo A. Baeza-Yates,et al.  Crawling a country: better strategies than breadth-first for web page ordering , 2005, WWW '05.

[24]  Carl D. Meyer,et al.  Deeper Inside PageRank , 2004, Internet Math..

[25]  Ricardo A. Baeza-Yates,et al.  Crawling the Infinite Web: Five Levels Are Enough , 2004, WAW.

[26]  Marc Najork,et al.  Breadth-first crawling yields high-quality pages , 2001, WWW '01.

[27]  Taher H. Haveliwala Topic-Sensitive PageRank: A Context-Sensitive Ranking Algorithm for Web Search , 2003, IEEE Trans. Knowl. Data Eng..