Ontology-based Document Spanning Systems for Information Extraction

Information Extraction (IE) is the task of automatically organizing in a structured form data extracted from free text documents. In several contexts, it is often desirable that the extracted data are then organized according to an ontology, which provides a formal and conceptual representation of the domain of interest. Ontologies allow for a better data interpretation, as well as for their semantic integration with other information, as in Ontology-based Data Access (OBDA), a popular declarative framework for data management where an ontology is connected to a data layer through mappings. However, the data layer considered so far in OBDA has consisted essentially of relational databases, and how to declaratively couple an ontology with unstructured data sources is still unexplored. By leveraging the recent study on document spanners for rule-based IE by Fagin et al., in this paper, we propose a new framework that allows to map text documents to ontologies, in the spirit of OBDA. We investigate the problem of answering conjunctive queries in this framework. For ontologies specified in the Description Logics [Formula: see text] and [Formula: see text], we show that the problem is polynomial in the size of the underlying documents. We also provide algorithms to solve query answering by rewriting the input query on the basis of the ontology and its mapping toward the source documents. Through these techniques, we pursue a virtual approach, similar to that typically adopted in OBDA, which allows us to answer a query without having to first populate the entire ontology. Interestingly, for [Formula: see text], both the spanners used in the mapping and the one computed by the rewriting algorithm belong to the same class of expressiveness. This holds also for [Formula: see text], modulo some limitations on the form of the mapping. These results say that in these cases our framework can be easily implemented by decoupling ontology management and document access, which can be delegated to an external IE system able to process the extraction rules we use in the mapping.

[1]  Seokwon Kang,et al.  GATE , 2019, Proceedings of the 56th Annual Design Automation Conference 2019.

[2]  Domenico Lembo,et al.  A Formal Framework for Coupling Document Spanners with Ontologies , 2019, 2019 IEEE Second International Conference on Artificial Intelligence and Knowledge Engineering (AIKE).

[3]  Ronald Fagin,et al.  Recursive Programs for Document Spanners , 2017, ICDT.

[4]  Diego Calvanese,et al.  Ontology-based data access - Beyond relational sources , 2019, Intelligenza Artificiale.

[5]  Diego Calvanese,et al.  Ontology-Based Data Access: A Survey , 2018, IJCAI.

[6]  Maurizio Lenzerini,et al.  Using Ontologies for Semantic Data Integration , 2018, A Comprehensive Guide Through the Italian Database Research.

[7]  Diego Calvanese,et al.  Ontology-Based Data Access and Integration , 2018, Encyclopedia of Database Systems.

[8]  Frederick Reiss,et al.  Declarative Cleaning of Inconsistencies in Information Extraction , 2016, TODS.

[9]  Evgenij Thorstensen,et al.  Mapping Analysis in Ontology-based Data Access: Algorithms and Complexity (Extended Abstract) , 2015, Description Logics.

[10]  RONALD FAGIN,et al.  Document Spanners , 2015, J. ACM.

[11]  Maurizio Lenzerini,et al.  MASTRO: A Reasoner for Effective Ontology-Based Data Access , 2012, ORE.

[12]  Sebastian Rudolph,et al.  Query Answering in the Horn Fragments of the Description Logics SHOIQ and SROIQ , 2011, IJCAI.

[13]  Luke S. Zettlemoyer,et al.  Knowledge-Based Weak Supervision for Information Extraction of Overlapping Relations , 2011, ACL.

[14]  Diego Calvanese,et al.  The MASTRO system for ontology-based data access , 2011, Semantic Web.

[15]  Frederick Reiss,et al.  SystemT: An Algebraic Approach to Declarative Information Extraction , 2010, ACL.

[16]  Dejing Dou,et al.  Ontology-based information extraction: An introduction and a survey of current approaches , 2010, J. Inf. Sci..

[17]  Bernardo Cuenca Grau,et al.  OWL 2 Web Ontology Language: Profiles , 2009 .

[18]  Boris Motik,et al.  OWL 2: The next step for OWL , 2008, J. Web Semant..

[19]  Sunita Sarawagi,et al.  Information Extraction , 2008 .

[20]  Diego Calvanese,et al.  Linking Data to Ontologies , 2008, J. Data Semant..

[21]  Diego Calvanese,et al.  Tractable Reasoning and Efficient Query Answering in Description Logics: The DL-Lite Family , 2007, Journal of Automated Reasoning.

[22]  Jeffrey F. Naughton,et al.  Declarative Information Extraction Using Datalog with Embedded Extraction Predicates , 2007, VLDB.

[23]  Diego Calvanese,et al.  The Description Logic Handbook , 2007 .

[24]  Diego Calvanese,et al.  Data Complexity of Query Answering in Description Logics , 2006, Description Logics.

[25]  Franz Baader,et al.  Pushing the EL Envelope , 2005, IJCAI.

[26]  Boris Motik,et al.  Data Complexity of Reasoning in Very Expressive Description Logics , 2005, IJCAI.

[27]  Dayne Freitag,et al.  Machine Learning for Information Extraction in Informal Domains , 2000, Machine Learning.

[28]  Ronald Fagin,et al.  Data exchange: semantics and query answering , 2003, Theor. Comput. Sci..

[29]  Diego Calvanese,et al.  The Description Logic Handbook: Theory, Implementation, and Applications , 2003, Description Logic Handbook.

[30]  Maurizio Lenzerini,et al.  Data integration: a theoretical perspective , 2002, PODS.

[31]  Maria T. Pazienza,et al.  Information Extraction , 2002, Lecture Notes in Computer Science.

[32]  Alon Y. Halevy,et al.  Recursive Query Plans for Data Integration , 2000, J. Log. Program..

[33]  James H. Martin,et al.  Speech and language processing: an introduction to natural language processing , 2000 .

[34]  James H. Martin,et al.  Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[35]  Hamish Cunningham,et al.  GATE-a General Architecture for Text Engineering , 1996, COLING.

[36]  Ralph Grishman,et al.  Message Understanding Conference- 6: A Brief History , 1996, COLING.