Semiautomatic Generation of Data-Extraction Ontologies from Relational Databases

Data extraction is the process used to gather and structure information in documents (e.g.Web pages). One approach to data extraction is the so-called ontology based data extraction. In this approach, an ontology is used as a guide to the parser that extracts data from the source documents. In this context, an ontology is a conceptual schema enriched with information needed to identify data items in the sources. The process of creation of an ontology is not a trivial task and may require the analysis of a big number of document instances. However, in many extraction applications, the information that is being extracted may already be modeled in a relational database. In this case, the relational database schema can be used as a startingpoint to the construction of a data extraction ontology. Analysis of data instances stored in the database may help to generate the information used to parse data items in document sources. This paper presents a method for the semiautomatic creation of a data extraction ontology. This process is based on reverse engineering of the relational database schema combined with the analysis of data instances.

[1]  Donald E. Knuth,et al.  The art of computer programming: sorting and searching (volume 3) , 1973 .

[2]  Berthier A. Ribeiro-Neto,et al.  A brief survey of web data extraction tools , 2002, SGMD.

[3]  Alberto O. Mendelzon,et al.  Database techniques for the World-Wide Web: a survey , 1998, SGMD.

[4]  Robert J. Schalkoff,et al.  Pattern recognition - statistical, structural and neural approaches , 1991 .

[5]  François Denis,et al.  Learning Regular Languages from Simple Positive Examples , 2001, Machine Learning.

[6]  Shamkant B. Navathe,et al.  Conceptual Database Design: An Entity-Relationship Approach , 1991 .

[7]  Donald E. Knuth,et al.  The Art of Computer Programming: Volume 3: Sorting and Searching , 1998 .

[8]  Michael G. Thomason,et al.  Syntactic Pattern Recognition, An Introduction , 1978, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Hector Garcia-Molina,et al.  Semistructured Data: The Tsimmis Experience , 1997, ADBIS.

[10]  Jian-Yun Nie Heterogeneous Web Data Extraction using Ontology , 2001 .

[11]  David W. Embley Object database development - concepts and principles , 1997 .

[12]  David W. Embley,et al.  Ontology-based extraction and structuring of information from data-rich unstructured documents , 1998, CIKM '98.

[13]  Ricardo Baeza-Yates,et al.  Information Retrieval: Data Structures and Algorithms , 1992 .

[14]  Venkata Subramaniam,et al.  Information Retrieval: Data Structures & Algorithms , 1992 .

[15]  Steven Feuerstein,et al.  Oracle PL/SQL Programming , 1993 .

[16]  Alberto H. F. Laender,et al.  DEByE - Uma ferramenta para Extração de Dados Semi-Estruturados , 1999, SBBD.