论文信息 - Articulating the construction of a web scraper for massive data extraction

Articulating the construction of a web scraper for massive data extraction

Massive volumes of data are generated by various users, entities, applications and disseminated online. This copious volume of big data is distributed across millions of websites and is available for various applications. Search engines do provide a simple mechanism to access this data. Accessing this data using search engines requires a user to spend time and resources to manually click and download. Clearly, such a manual approach is not scalable for a vast majority of real life applications at the enterprise and organization level. There exist a number of automated approaches to data extraction from the web. Most of these approaches are ad-hoc and domain specific. Therefore, the need for a robust, automated, easy to use framework for extracting content from the web with a minimal human effort across domains appears enticing. The architecture proposed by the authors for a web scraper addresses this gap to harvest data from the web. The proposed web scraping framework offers an easy and feasible approach for parsing and extracting data on a large scale from multiple websites with minimal human intervention. This paper provides an insight into issues relevant to constructing a web scraper and concludes by describing the implementation of a web scraper for harvesting learning objects for an eLearning application.

[1] Séamus Lawless,et al. A Framework for Content Preparation to Support Open-Corpus Adaptive Hypermedia , 2009 .

[2] Andrew B. Collmus,et al. A primer on theory-driven web scraping: Automatic extraction of big data from the Internet for use in psychological research. , 2016, Psychological methods.

[3] Anália Lourenço,et al. Web scraping technologies in an API world , 2014, Briefings Bioinform..

[4] Donato Summa,et al. Use of web scraping and text mining techniques in the Istat survey on “Information and Communication Technology in enterprises” , 2014 .

[5] Pasquale De Meo,et al. Web Data Extraction , Applications and Techniques : A Survey , 2010 .

[6] Norman Meuschke,et al. Scraping Scientific Web Repositories: Challenges and Solutions for Automated Content Extraction , 2016, D Lib Mag..

[7] Michael Schrenk,et al. Webbots, Spiders, and Screen Scrapers: A Guide to Developing Internet Agents with PHP/CURL , 2007 .

[8] Guofei Gu,et al. SEMAGE: a new image-based two-factor CAPTCHA , 2011, ACSAC '11.