RCrawler: An R package for parallel web crawling and scraping

RCrawler is a contributed R package for domain-based web crawling and content scraping. As the first implementation of a parallel web crawler in the R environment, RCrawler can crawl, parse, store pages, extract contents, and produce data that can be directly employed for web content mining applications. However, it is also flexible, and could be adapted to other applications. The main features of RCrawler are multi-threaded crawling, content extraction, and duplicate content detection. In addition, it includes functionalities such as URL and content-type filtering, depth level controlling, and a robot.txt parser. Our crawler has a highly optimized system, and can download a large number of pages per second while being robust against certain crashes and spider traps. In this paper, we describe the design and functionality of RCrawler, and report on our experience of implementing it in an R environment, including different optimizations that handle the limitations of R. Finally, we discuss our experimental results.

[1]  Gurmeet Singh Manku,et al.  Detecting near-duplicates for web crawling , 2007, WWW '07.

[2]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[3]  Hadley Wickham,et al.  Tools for Working with URLs and HTTP , 2016 .

[4]  Simon Munzert,et al.  Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining , 2014 .

[5]  Bing Liu,et al.  Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data , 2006, Data-Centric Systems and Applications.

[6]  Jaideep Srivastava,et al.  Web usage mining: discovery and applications of usage patterns from Web data , 2000, SKDD.

[7]  Yanchang Zhao R and Data Mining: Examples and Case Studies , 2012 .

[8]  Lillian Lee,et al.  Opinion Mining and Sentiment Analysis , 2008, Found. Trends Inf. Retr..

[9]  Sebastiano Vigna,et al.  UbiCrawler: a scalable fully distributed Web crawler , 2004, Softw. Pract. Exp..

[10]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[11]  Lior Rokach,et al.  Introduction to Recommender Systems Handbook , 2011, Recommender Systems Handbook.

[12]  Zdravko Markov,et al.  Comprar Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage | Daniel T. Larose | 9780471666554 | Wiley , 2007 .

[13]  Sang Ho Lee,et al.  On URL Normalization , 2005, ICCSA.

[14]  Carlos Castillo,et al.  Effective web crawling , 2005, SIGF.

[15]  Xiaolong Wang,et al.  Online topic detection and tracking of financial news based on hierarchical clustering , 2010, 2010 International Conference on Machine Learning and Cybernetics.

[16]  Rajendra Akerkar,et al.  Intelligent Technologies for Web Applications , 2012 .

[17]  Steve Weston,et al.  Foreach Parallel Adaptor for the 'parallel' Package , 2015 .

[18]  Steve Weston,et al.  Provides Foreach Looping Construct for R , 2015 .

[19]  Emanuele Della Valle,et al.  Web Information Retrieval , 2013, Data-Centric Systems and Applications.

[20]  Zhiguo Gong,et al.  Web structure mining: an introduction , 2005, 2005 IEEE International Conference on Information Acquisition.

[21]  Zdravko Markov,et al.  Data mining the web - uncovering patterns in web content, structure, and usage , 2007 .

[22]  Marc Najork,et al.  Web Crawling , 2010, Found. Trends Inf. Retr..

[23]  Marc Najork,et al.  Mercator: A scalable, extensible Web crawler , 1999, World Wide Web.

[24]  Pasquale De Meo,et al.  Web Data Extraction , Applications and Techniques : A Survey , 2010 .