Information Extraction from Unstructured and Ungrammatical Data Sources for Semantic Annotation

The internet has become an attractive avenue for global e-business, e-learning, knowledge sharing, etc. Due to continuous increase in the volume of web content, it is not practically possible for a user to extract information by browsing and integrating data from a huge amount of web sources retrieved by the existing search engines. The semantic web technology enables advancement in information extraction by providing a suite of tools to integrate data from different sources. To take full advantage of semantic web, it is necessary to annotate existing web pages into semantic web pages. This research develops a tool, named OWIE (Ontology-based Web Information Extraction), for semantic web annotation using domain specific ontologies. The tool automatically extracts information from html pages with the help of pre-defined ontologies and gives them semantic representation. Two case studies have been conducted to analyze the accuracy of OWIE. Keywords—Ontology, Semantic Annotation, Wrapper, Information Extraction.

[1]  Stephen Soderland,et al.  Learning Information Extraction Rules for Semi-Structured and Free Text , 1999, Machine Learning.

[2]  Arnaud Sahuguet,et al.  Building Light-Weight Wrappers for Legacy Web Data-Sources Using W4F , 1999, VLDB.

[3]  Jayant Madhavan,et al.  Web-Scale Data Integration: You can afford to Pay as You Go , 2007, CIDR.

[4]  Berthier A. Ribeiro-Neto,et al.  A brief survey of web data extraction tools , 2002, SGMD.

[5]  Brad Adelberg,et al.  NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents , 1998, SIGMOD '98.

[6]  Hector Garcia-Molina,et al.  Semistructured Data: The Tsimmis Experience , 1997, ADBIS.

[7]  Peter Mika Social Networks and the Semantic Web (Semantic Web and Beyond) , 2007 .

[8]  Hieu Le Quang,et al.  Integration of Web Data Sources: A Survey of Existing Problems , 2005, Grundlagen von Datenbanken.

[9]  Dunja Mladenic,et al.  A Roadmap for Web Mining: From Web to Semantic Web , 2003, EWMF.

[10]  David W. Embley,et al.  Conceptual-Model-Based Data Extraction from Multiple-Record Web Pages , 1999, Data Knowl. Eng..

[11]  Silvia Miksch,et al.  Motivating Ontology-Driven Information Extraction , 2011 .

[12]  David W. Embley,et al.  Automatic Creation and Simplified Querying of Semantic Web Content: An Approach Based on Information-Extraction Ontologies , 2006, ASWC.

[13]  Hongjun Lu,et al.  iASA: Learning to Annotate the Semantic Web , 2005, J. Data Semant..

[14]  Arthur Stutt,et al.  MnM: Ontology Driven Semi-automatic and Automatic Support for Semantic Markup , 2002, EKAW.

[15]  Steffen Staab,et al.  S-CREAM: Semiautomatic CREAtion of Metadata , 2002, SAAKM@ECAI.

[16]  A Min Tjoa,et al.  Semantic Web challenges and new requirements , 2005, 16th International Workshop on Database and Expert Systems Applications (DEXA'05).

[17]  Frank van Harmelen,et al.  A semantic web primer , 2004 .

[18]  Craig A. Knoblock,et al.  Hierarchical Wrapper Induction for Semistructured Information Sources , 2004, Autonomous Agents and Multi-Agent Systems.

[19]  Giacomo Fiumara,et al.  Automated Information Extraction from Web Sources : a Survey , 2007 .

[20]  Cui Tao,et al.  Automating the extraction of data from HTML tables with unknown structure , 2005, Data Knowl. Eng..

[21]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[22]  M. Wilson,et al.  The semantic Web: prospects and challenges , 2006, 2006 7th International Baltic Conference on Databases and Information Systems.

[23]  Andreas Hotho,et al.  Towards Semantic Web Mining , 2002, SEMWEB.

[24]  Alberto O. Mendelzon,et al.  WebOQL: restructuring documents, databases, and webs , 1999 .

[25]  Hyoil Han,et al.  Survey of semantic annotation platforms , 2005, SAC '05.