Discovering URLs through user feedback

Search engines rely upon crawling to build their Web page collections. A Web crawler typically discovers new URLs by following the link structure induced by links on Web pages. As the number of documents on the Web is large, discovering newly created URLs may take arbitrarily long, and depending on how a given page is connected to others, such a crawler may miss the pages altogether. In this paper, we evaluate the benefits of integrating a passive URL discovery mechanism into a Web crawler. This mechanism is passive in the sense that it does not require the crawler to actively fetch documents from the Web to discover URLs. We focus here on a mechanism that uses toolbar data as a representative source for new URL discovery. We use the toolbar logs of Yahoo! to characterize the URLs that are accessed by users via their browsers, but not discovered by Yahoo! Web crawler. We show that a high fraction of URLs that appear in toolbar logs are not discovered by the crawler. We also reveal that a certain fraction of URLs are discovered by the crawler later than the time they are first accessed by users. One important conclusion of our work is that web search engines can highly benefit from user feedback in the form of toolbar logs for passive URL discovery.

[1]  Hector Garcia-Molina,et al.  Efficient Crawling Through URL Ordering , 1998, Comput. Networks.

[2]  Michael K. Bergman White Paper: The Deep Web: Surfacing Hidden Value , 2001 .

[3]  Sebastiano Vigna,et al.  UbiCrawler: a scalable fully distributed Web crawler , 2004, Softw. Pract. Exp..

[4]  Susan T. Dumais,et al.  The web changes everything: understanding the dynamics of web content , 2009, WSDM '09.

[5]  Augusto de Carvalho Fontes,et al.  SmartCrawl: a new strategy for the exploration of the hidden web , 2004, WIDM '04.

[6]  C. Lee Giles,et al.  Accessibility of information on the Web , 2000, INTL.

[7]  Hector Garcia-Molina,et al.  Effective page refresh policies for Web crawlers , 2003, TODS.

[8]  Marios D. Dikaiakos,et al.  Design and Implementation of a Distributed Crawler and Filtering Processor , 2002, NGITS.

[9]  Philip S. Yu,et al.  Optimal crawling strategies for web search engines , 2002, WWW '02.

[10]  Torsten Suel,et al.  Design and implementation of a high-performance distributed Web crawler , 2002, Proceedings 18th International Conference on Data Engineering.

[11]  Georgia Koutrika,et al.  Can social bookmarking improve web search? , 2008, WSDM '08.

[12]  Petros Zerfos,et al.  Downloading textual hidden web content through keyword queries , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[13]  Jenny Edwards,et al.  An adaptive model for optimizing performance of an incremental web crawler , 2001, WWW '01.

[14]  Jayant Madhavan,et al.  Google's Deep Web crawl , 2008, Proc. VLDB Endow..

[15]  José Rufino,et al.  Geographical partition for distributed web crawling , 2005, GIR '05.

[16]  Christopher Olston,et al.  What's new on the web?: the evolution of the web from a search engine perspective , 2004, WWW '04.

[17]  Anirban Dasgupta,et al.  The discoverability of the web , 2007, WWW '07.

[18]  Nick Craswell,et al.  The impact of crawl policy on web search effectiveness , 2009, SIGIR.

[19]  Hector Garcia-Molina,et al.  Parallel crawlers , 2002, WWW.

[20]  Marc Najork,et al.  Breadth-first crawling yields high-quality pages , 2001, WWW '01.

[21]  Marc Najork,et al.  Mercator: A scalable, extensible Web crawler , 1999, World Wide Web.

[22]  José Rufino,et al.  Efficient Partitioning Strategies for Distributed Web Crawling , 2007, ICOIN.

[23]  Marc Najork,et al.  Web Crawling , 2010, Found. Trends Inf. Retr..

[24]  Berkant Barla Cambazoglu,et al.  On the feasibility of geographically distributed web crawling , 2008, Infoscale.

[25]  Susan T. Dumais,et al.  Resonance on the web: web dynamics and revisitation patterns , 2009, CHI.

[26]  Sriram Raghavan,et al.  Crawling the Hidden Web , 2001, VLDB.

[27]  Thorsten Joachims,et al.  Eye-tracking analysis of user behavior in WWW search , 2004, SIGIR '04.

[28]  Sandeep Pandey,et al.  User-centric Web crawling , 2005, WWW '05.

[29]  Marc Najork,et al.  A large‐scale study of the evolution of Web pages , 2004, Softw. Pract. Exp..

[30]  Ravi Kumar,et al.  A characterization of online browsing behavior , 2010, WWW '10.

[31]  Juliana Freire,et al.  An adaptive crawler for locating hidden-Web entry points , 2007, WWW '07.

[32]  Filip Radlinski,et al.  Query chains: learning to rank from implicit feedback , 2005, KDD '05.

[33]  Hector Garcia-Molina,et al.  The Evolution of the Web and Implications for an Incremental Crawler , 2000, VLDB.

[34]  Kevin S. McCurley,et al.  Ranking the web frontier , 2004, WWW '04.

[35]  Dmitri Loguinov,et al.  IRLbot: scaling to 6 billion pages and beyond , 2008, WWW.