The Viúva Negra crawler: an experience report

This paper documents hazardous situations on the Web that crawlers must address. This knowledge was accumulated while developing and operating the Viúva Negra (VN) crawler to feed a search engine and a Web archive for the Portuguese Web for four years. The design, implementation and evaluation of the VN crawler are also presented as a case study of a Web crawler design. The case study tested provides crawling techniques that may be useful for the further development of crawlers. Copyright © 2007 John Wiley & Sons, Ltd.

[1]  Junghoo Cho,et al.  Impact of search engines on page popularity , 2004, WWW '04.

[2]  Ricardo A. Baeza-Yates,et al.  Crawling the Infinite Web: Five Levels Are Enough , 2004, WAW.

[3]  Andrei Z. Broder,et al.  Efficient URL caching for world wide web crawling , 2003, WWW '03.

[4]  Binzhang Liu Characterizing Web Response Time , 1998 .

[5]  Daniel Gomes,et al.  Design and Selection Criteria for a National Web Archive , 2006, ECDL.

[6]  Tim Berners-Lee,et al.  Hypertext transfer protocol--http/i , 1993 .

[7]  Edleno Silva de Moura,et al.  Detecção de Réplicas Utilizando Conteúdo e Estrutura , 2005, SBBD.

[8]  Sebastiano Vigna,et al.  UbiCrawler: a scalable fully distributed Web crawler , 2004, Softw. Pract. Exp..

[9]  Roy T. Fielding,et al.  Uniform Resource Identifier (URI): Generic Syntax , 2005, RFC.

[10]  Steffen Staab,et al.  On deep annotation , 2003, WWW '03.

[11]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[12]  Andrei Z. Broder,et al.  A Comparison of Techniques to Find Mirrored Hosts on the WWW , 2000, IEEE Data Eng. Bull..

[13]  Daniel Gomes,et al.  Managing duplicates in a web archive , 2006, SAC.

[14]  Mário J. Silva,et al.  The WebCAT framework automatic generation of meta-data for Web resources , 2005, The 2005 IEEE/WIC/ACM International Conference on Web Intelligence (WI'05).

[15]  Craig E. Wills Examining the Cacheability of User-Requested Web Resources , 1999 .

[16]  Allison Woodruff,et al.  An Investigation of Documents from the World Wide Web , 1996, Comput. Networks.

[17]  Linda Dailey Paulson,et al.  Building Rich Web Applications with Ajax , 2005, Computer.

[18]  Kevin S. McCurley,et al.  Ranking the web frontier , 2004, WWW '04.

[19]  Mário J. Silva,et al.  Language identification in web pages , 2005, SAC '05.

[20]  Torsten Suel,et al.  Design and implementation of a high-performance distributed Web crawler , 2002, Proceedings 18th International Conference on Data Engineering.

[21]  Daniel Gomes,et al.  Versus: A Web Repository , 2002 .

[22]  Sriram Raghavan,et al.  Crawling the Hidden Web , 2001, VLDB.

[23]  Hector Garcia-Molina,et al.  Efficient Crawling Through URL Ordering , 1998, Comput. Networks.

[24]  Elliot Berk,et al.  JLex: A lexical analyzer generator for Java , 2004 .

[25]  Marc Najork,et al.  Mercator: A scalable, extensible Web crawler , 1999, World Wide Web.

[26]  Daniel Gomes,et al.  Characterizing a national community web , 2005, TOIT.

[27]  Marc Abrams,et al.  Analysis of Sources of Latency in Downloading Web Pages , 2000, WebNet.

[28]  Hector Garcia-Molina,et al.  Crawler-Friendly Web Servers , 2000, PERV.

[29]  Phillip Hallam-Baker,et al.  Session Identification URI , 1996 .

[30]  Vivian Cothey,et al.  Web-crawling reliability , 2004, J. Assoc. Inf. Sci. Technol..

[31]  Anja Feldmann,et al.  Rate of Change and other Metrics: a Live Study of the World Wide Web , 1997, USENIX Symposium on Internet Technologies and Systems.

[32]  Hector Garcia-Molina,et al.  Parallel crawlers , 2002, WWW.

[33]  Sriram Raghavan,et al.  Stanford WebBase components and applications , 2006, TOIT.

[34]  David Barr,et al.  Common DNS Operational and Configuration Errors , 1996, RFC.

[35]  Mike Thelwall,et al.  A free database of university web links: data collection issues , 2002 .

[36]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[37]  Ricardo A. Baeza-Yates,et al.  On the image content of the Chilean Web , 2003, Proceedings of the IEEE/LEOS 3rd International Conference on Numerical Simulation of Semiconductor Optoelectronic Devices (IEEE Cat. No.03EX726).

[38]  Hongfei Yan,et al.  Architectural design and evaluation of an efficient web-crawling system , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.

[39]  Rajeev Motwani,et al.  Stratified Planning , 2009, IJCAI.

[40]  Hanoch Levy,et al.  Evaluating web user perceived latency using server side measurements , 2003, Comput. Commun..

[41]  Andy Cockburn,et al.  What do web users do? An empirical analysis of web use , 2001, Int. J. Hum. Comput. Stud..