Ontology-Based Extraction of RDF Data from the World Wide Web

ONTOLOGY-BASED EXTRACTION OF RDF DATA FROM THE WORLD WIDE WEB Tim Chartrand Department of Computer Science Master of Science The simplicity and proliferation of the World Wide Web (WWW) has taken the availability of information to an unprecedented level. The next generation of the Web, the Semantic Web, seeks to make information more usable by machines by introducing a more rigorous structure based on ontologies. One hinderance to the Semantic Web is the lack of existing semantically marked-up data. Until there is a critical mass of Semantic Web data, few people will develop and use Semantic Web applications. This project helps promote the Semantic Web by providing content. We apply existing information-extraction techniques, in particular, the BYU ontologybased data-extraction system, to extract information from the WWW based on a Semantic Web ontology to produce Semantic Web data with respect to that ontology. As an example of how the generated Semantic Web data can be used, we provide an application to browse the extracted data and the source documents together. In this sense, the extracted data is superimposed over or is an index over the source documents. Our experiments with ontologies in four application domains show that our approach can indeed extract Semantic Web data from the WWW with precision and recall similar to that achieved by the underlying information extraction system and make that data accessible to Semantic Web applications.

[1]  Steven J. DeRose,et al.  XML Path Language (XPath) Version 1.0 , 1999 .

[2]  Steffen Staab,et al.  Authoring and annotation of web pages in CREAM , 2002, WWW.

[3]  Steffen Staab,et al.  S-CREAM: Semiautomatic CREAtion of Metadata , 2002, SAAKM@ECAI.

[4]  Steffen Staab,et al.  S-CREAM: Semiautomatic CREAtion of Metadata , 2002, SAAKM@ECAI.

[5]  James A. Hendler,et al.  New Tools for the Semantic Web , 2002, EKAW.

[6]  Dan Brickley,et al.  Resource Description Framework (RDF) Model and Syntax Specification , 2002 .

[7]  Fabio Ciravegna,et al.  Adaptive Information Extraction from Text by Rule Induction and Generalisation , 2001, IJCAI.

[8]  Line Eikvil,et al.  Information Extraction from World Wide Web - A Survey , 1999 .

[9]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[10]  Berthier A. Ribeiro-Neto,et al.  A brief survey of web data extraction tools , 2002, SGMD.

[11]  David W. Embley,et al.  Conceptual-Model-Based Data Extraction from Multiple-Record Web Pages , 1999, Data Knowl. Eng..

[12]  Marja-Riitta Koivunen,et al.  Annotea: an open RDF infrastructure for shared Web annotations , 2001, WWW '01.

[13]  David W. Embley,et al.  An Active, Object-Oriented, Model-Equivalent Programming Language , 1995, Advances in Object-Oriented Data Modeling.

[14]  David W. Embley Object database development - concepts and principles , 1997 .

[15]  Günter Neumann,et al.  An Information Extraction Core System for Real World German Text Processing , 1997, ANLP.

[16]  Steffen Staab,et al.  From Manual to Semi-Automatic Semantic Annotation: About Ontology-Based Text Annotation Tools , 2000, SAIC@COLING.

[17]  Dan Brickley,et al.  Rdf vocabulary description language 1.0 : Rdf schema , 2004 .

[18]  David W. Embley Programming with data frames for everyday data items , 1980, AFIPS '80.

[19]  Andreas Witt,et al.  Lessons Learned from Applying AI to the Web , 2000, Int. J. Cooperative Inf. Syst..

[20]  David W. Embley,et al.  An Integrated Ontology Development Environment for Data Extraction , 2003, ISTA.

[21]  David W. Embley,et al.  Record-boundary discovery in Web documents , 1999, SIGMOD '99.

[22]  Steven J. DeRose,et al.  Xml pointer language (xpointer) , 1998 .

[23]  Michael Kifer,et al.  Logical foundations of object-oriented and frame-based languages , 1995, JACM.

[24]  Douglas Stott Parker Aesthetics-Based Graph Layout for Human Consumption , 1996, Softw. Pract. Exp..

[25]  Lois M. L. Delcambre,et al.  Superimposed Schematics: Introducing E-R Structure for In-Situ Information Selections , 2002, ER.

[26]  Lois M. L. Delcambre,et al.  Superimposed Information for the Internet , 1999, WebDB.

[27]  James A. Hendler,et al.  The Semantic Web" in Scientific American , 2001 .

[28]  A. Karimi,et al.  Master‟s thesis , 2011 .

[29]  Patrick A. V. Hall,et al.  Approximate String Matching , 1994, Encyclopedia of Algorithms.

[30]  Erhard Rahm,et al.  A survey of approaches to automatic schema matching , 2001, The VLDB Journal.

[31]  David W. Embley,et al.  Multifaceted Exploitation of Metadata for Attribute Match Discovery in Information Integration , 2001, Workshop on Information Integration on the Web.