Web Data Extraction System

SYNONYMS web data extraction toolkit, web information extraction system, wrapper generator, wrapper generator toolkit, web macros, web scraper. DEFINITION A web data extraction system is a software system that automatically and repeatedly extracts data from web pages with changing content and delivers the extracted data to a database or some other application. The task of web data extraction performed by such a system is usually divided into five different functions: (1) web interaction, which comprises mainly the navigation to usually predetermined target web pages containing the desired information; (2) support for wrapper generation and execution, where a wrapper is a program that identifies the desired data on target pages, extracts the data and transforms it into a structured format; (3) scheduling, which allows repeated application of previously generated wrappers to their respective target pages; (4) data transformation, which includes filtering, transforming, refining, and integrating data extracted from one or more sources and structuring the result according to a desired output format (usually XML or relational tables); and (5) delivering the resulting structured data to external applications such as database management systems, data warehouses, business software systems, content management systems, decision support systems, RSS publishers, email servers, or SMS servers. Alternatively, the output can be used to generate new web services out of existing and continually changing web sources.

[1]  Ángel Viña,et al.  The Denodo Data Integration Platform , 2002, VLDB.

[2]  Georg Gottlob,et al.  Monadic datalog and the expressive power of languages for web information extraction , 2002, JACM.

[3]  Arnaud Sahuguet,et al.  Building intelligent Web applications using lightweight wrappers , 2001, Data Knowl. Eng..

[4]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[5]  Stefan Kuhlins,et al.  Toolkits for Generating Wrappers : A Survey of Software Toolkits for Automated Datat Extraction from Websites , 2003 .

[6]  Craig A. Knoblock,et al.  A hierarchical approach to wrapper induction , 1999, AGENTS '99.

[7]  Juliana Freire,et al.  Automating Web navigation with the WebVCR , 2000, Comput. Networks.

[8]  Alberto H. F. Laender,et al.  DEByE - Data Extraction By Example , 2002, Data Knowl. Eng..

[9]  Georg Gottlob,et al.  Visual Web Information Extraction with Lixto , 2001, VLDB.

[10]  Doug Downey,et al.  Web-scale information extraction in knowitall: (preliminary results) , 2004, WWW '04.

[11]  Wolfgang Gatterbauer,et al.  Towards domain-independent information extraction from web tables , 2007, WWW '07.

[12]  Calton Pu,et al.  XWRAP: an XML-enabled wrapper construction system for Web information sources , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[13]  Nicholas Kushmerick,et al.  Wrapper Induction for Information Extraction , 1997, IJCAI.

[14]  Georg Gottlob,et al.  A Formal Comparison of Visual Web Wrapper Generators , 2003, SOFSEM.

[15]  Robert L. Grossman,et al.  Mining Web pages for data records , 2004, IEEE Intelligent Systems.