论文信息 - BUbiNG: massive crawling for the masses - 字舞流文

BUbiNG: massive crawling for the masses

Although web crawlers have been around for twenty years by now, there is virtually no freely available, open-source crawling software that guarantees high throughput, overcomes the limits of single-machine tools and at the same time scales linearly with the amount of resources available. This paper aims at filling this gap.

Sebastiano Vigna | Andrea Marino | Paolo Boldi | Massimo Santini

[1] Jenny Edwards,et al. An adaptive model for optimizing performance of an incremental web crawler , 2001, WWW '01.

[2] Sergey Brin,et al. The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[3] Gregor von Bochmann,et al. A brief history of web crawlers , 2013, CASCON.

[4] Jens Stoye,et al. Simple and flexible detection of contiguous repeats using a suffix tree , 2002, Theor. Comput. Sci..

[5] Idit Keidar,et al. Do not crawl in the DUST: different URLs with similar text , 2006, WWW.

[6] Geoffrey Zweig,et al. Syntactic Clustering of the Web , 1997, Comput. Networks.

[7] Sebastiano Vigna,et al. UbiCrawler: a scalable fully distributed Web crawler , 2004, Softw. Pract. Exp..

[8] David Eichmann,et al. The RBSE spider — Balancing effective search against Web load , 1994, WWW Spring 1994.

[9] Denis Shestakov. Current Challenges in Web Crawling , 2013, ICWE.

[10] Maged M. Michael,et al. Simple, fast, and practical non-blocking and blocking concurrent queue algorithms , 1996, PODC '96.

[11] Marc Najork,et al. High-performance Web Crawling High-performance Web Crawling Publication History , 2001 .

[12] Marc Najork,et al. A large‐scale study of the evolution of Web pages , 2003, WWW '03.

[13] Hiroki Arimura,et al. Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications , 2001, CPM.

[14] Adam Rifkin,et al. Nutch: A Flexible and Scalable Open-Source Web Search Engine , 2005 .

[15] Oliver A. McBryan,et al. GENVL and WWWW: Tools for taming the Web , 1994, WWW Spring 1994.

[16] Burton H. Bloom,et al. Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[17] Torsten Suel,et al. Design and implementation of a high-performance distributed Web crawler , 2002, Proceedings 18th International Conference on Data Engineering.

[18] Marc Najork,et al. Mercator: A scalable, extensible Web crawler , 1999, World Wide Web.

[19] Dmitri Loguinov,et al. IRLbot: Scaling to 6 billion pages and beyond , 2009, TWEB.

[20] Marc Najork,et al. Web Crawling , 2010, Found. Trends Inf. Retr..

[21] Soumen Chakrabarti,et al. Mining the web - discovering knowledge from hypertext data , 2002 .

[22] Gurmeet Singh Manku,et al. Detecting near-duplicates for web crawling , 2007, WWW '07.

[23] Moses Charikar,et al. Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[24] B. Pinkerton,et al. Finding What People Want : Experiences with the WebCrawler , 1994, WWW Spring 1994.