On the feasibility of geographically distributed web crawling

We identify the issues that are important in design of a geographically distributed Web crawler. The identified issues are discussed from a "benefit" and "challenge" point of view. More specifically, we focus on the effect of geographical locality of Web sites on crawling performance, and, as a practical study, investigate the feasibility of a distributed crawler in terms of network costs. For this purpose, we conduct various experiments to collect network access statistics about the servers in the educational domains of eight different countries (USA, Canada, Chile, Brazil, Spain, Portugal, Turkey, and Greece). We gather the statistics from four different sites located in USA, Brazil, Spain, and Turkey using echoping. The results favor geographically distributed Web crawling in terms of crawling throughput.

[1]  Hector Garcia-Molina,et al.  Synchronizing a database to improve freshness , 2000, SIGMOD '00.

[2]  Hector Garcia-Molina,et al.  Estimating frequency of change , 2003, TOIT.

[3]  Marc Najork,et al.  Breadth-first crawling yields high-quality pages , 2001, WWW '01.

[4]  Hector Garcia-Molina,et al.  The Evolution of the Web and Implications for an Incremental Crawler , 2000, VLDB.

[5]  C. Lee Giles,et al.  Accessibility of information on the Web , 2000, INTL.

[6]  Serge Abiteboul,et al.  Adaptive on-line page importance computation , 2003, WWW '03.

[7]  Hector Garcia-Molina,et al.  Combating Web Spam with TrustRank , 2004, VLDB.

[8]  Torsten Suel,et al.  Design and implementation of a high-performance distributed Web crawler , 2002, Proceedings 18th International Conference on Data Engineering.

[9]  Hector Garcia-Molina,et al.  Parallel crawlers , 2002, WWW.

[10]  Dmitri Loguinov,et al.  IRLbot: scaling to 6 billion pages and beyond , 2008, WWW.

[11]  B. Huffaker,et al.  Distance Metrics in the Internet , 2002, Anais do 2002 International Telecommunications Symposium.

[12]  Marc Najork,et al.  Mercator: A scalable, extensible Web crawler , 1999, World Wide Web.

[13]  Qi Lu,et al.  Collaborative Web crawling: information gathering/processing over Internet , 1999, Proceedings of the 32nd Annual Hawaii International Conference on Systems Sciences. 1999. HICSS-32. Abstracts and CD-ROM of Full Papers.

[14]  José Rufino,et al.  Efficient Partitioning Strategies for Distributed Web Crawling , 2007, ICOIN.

[15]  Filippo Menczer,et al.  Search Engine-Crawler Symbiosis: Adapting to Community Interests , 2003, ECDL.

[16]  Ricardo A. Baeza-Yates,et al.  Crawling a country: better strategies than breadth-first for web page ordering , 2005, WWW '05.

[17]  Charles L. A. Clarke,et al.  Topic-oriented collaborative crawling , 2002, CIKM '02.

[18]  Ricardo A. Baeza-Yates,et al.  Challenges on Distributed Web Retrieval , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[19]  Hector Garcia-Molina,et al.  Efficient Crawling Through URL Ordering , 1998, Comput. Networks.

[20]  Suman Nath,et al.  Beyond Availability: Towards a Deeper Understanding of Machine Failure Characteristics in Large Distributed Systems , 2004, WORLDS.

[21]  Sandeep Pandey,et al.  Recrawl scheduling based on information longevity , 2008, WWW.

[22]  Sebastiano Vigna,et al.  UbiCrawler: a scalable fully distributed Web crawler , 2004, Softw. Pract. Exp..

[23]  Mark Levene,et al.  Web dynamics : adapting to change in content, size, topology and use , 2004 .

[24]  Hector Garcia-Molina,et al.  Link Spam Alliances , 2005, VLDB.

[25]  David Eichmann,et al.  The RBSE spider — Balancing effective search against Web load , 1994, WWW Spring 1994.

[26]  Berkant Barla Cambazoglu,et al.  Architecture of a grid-enabled Web search engine , 2007, Inf. Process. Manag..

[27]  Filippo Menczer,et al.  Crawling the Web , 2004, Web Dynamics.

[28]  George Cybenko,et al.  How dynamic is the Web? , 2000, Comput. Networks.

[29]  David Eichmann,et al.  2 – Background : Agents in General and Spiders in Particular , 1994 .

[30]  Marios D. Dikaiakos,et al.  Design and Implementation of a Distributed Crawler and Filtering Processor , 2002, NGITS.

[31]  Hector Garcia-Molina,et al.  Web Spam Taxonomy , 2005, AIRWeb.

[32]  José Rufino,et al.  Geographical partition for distributed web crawling , 2005, GIR '05.

[33]  Berkant Barla Cambazoglu,et al.  Data-Parallel Web Crawling Models , 2004, ISCIS.