Semantic Wrappers for Semi-Structured Data Extraction

In this paper, we propose an approach to extract information from HTML pages and to add semantic (XML) tags to them. Wrapping is an essential technique used to automatically extract information from Web sources. This paper describes both, a general approach based on rules, which can be used to automatically generate wrappers, and an assistant generator wrapper called WebMantic. We also provide some experimental results to show that both the rule generation process and the preprocessing task are fast and reliable.

[1]  Stefan Kuhlins,et al.  Toolkits for Generating Wrappers , 2002, NetObjectDays.

[2]  Sanjiva Weerawarana,et al.  Unraveling the Web services web: an introduction to SOAP, WSDL, and UDDI , 2002, IEEE Internet Computing.

[3]  Stephen Soderland,et al.  Learning to Extract Text-Based Information from the World Wide Web , 1997, KDD.

[4]  José M. Molina López,et al.  A Multi-Agent architecture for intelligent gathering systems , 2005, AI Communications.

[5]  Dan Suciu,et al.  Data on the Web: From Relations to Semistructured Data and XML , 1999 .

[6]  Craig A. Knoblock,et al.  Wrapper Maintenance: A Machine Learning Approach , 2011, J. Artif. Intell. Res..

[7]  Dayne Freitag,et al.  Information Extraction from HTML: Application of a General Machine Learning Approach , 1998, AAAI/IAAI.

[8]  Kristian J. Hammond,et al.  Knowledge-based information retrieval from semi-structured text , 1996 .

[9]  Yizhong Fan,et al.  Adaptive Agents for Information Gathering from Multiple, Distributed Information Sources , 1999 .

[10]  John Wang,et al.  Data Warehousing and Mining: Concepts, Methodologies, Tools, and Applications , 2008 .

[11]  Dayne Freitag,et al.  Using grammatical inference to improve precision in information extraction , 1997, ICML 1997.

[12]  Line Eikvil,et al.  Information Extraction from World Wide Web - A Survey , 1999 .

[13]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[14]  Jane Yung-jen Hsu,et al.  Template-Based Information Mining from HTML Documents , 1997, AAAI/IAAI.

[15]  Craig A. Knoblock,et al.  A hierarchical approach to wrapper induction , 1999, AGENTS '99.

[16]  Maria-Esther Vidal,et al.  Wrapper generation for Web accessible data sources , 1998, Proceedings. 3rd IFCIS International Conference on Cooperative Information Systems (Cat. No.98EX122).

[17]  Joann J. Ordille,et al.  Querying Heterogeneous Information Sources Using Source Descriptions , 1996, VLDB.

[18]  Craig A. Knoblock,et al.  Automatic Data Extraction from Lists and Tables in Web Sources , 2001 .

[19]  Ricardo Aler,et al.  Rule-Based Parsing for Web Data Extraction , 2008 .