A Formal Framework for Coupling Document Spanners with Ontologies

A significant portion of information that is nowa-days collected in enterprises and organizations resides in text documents, and thus is inherently unstructured. Turning it into a structured form is the aim of information extraction (IE). Depending on the approach followed, the output of an IE process can fill forms or populate relational tables, or can be presented through an ontology. This last approach is particularly interesting, since ontologies may facilitate the integration with other corporate and external data, and enable data management and governance at an abstract, conceptual level, as in Ontology-based Data Access (OBDA). To this aim, OBDA uses declarative mappings that specify the relation between the ontology and the database to be accessed. In OBDA, however, only mappings towards relational databases have been so far considered, and how to declaratively relate the ontology to unstructured sources is still unexplored. By leveraging the study on document spanners for IE, in this paper we propose a new framework that allows to map text documents to ontologies, in the spirit of the OBDA approach. We then investigate the problem of answering conjunctive queries (CQs) in our framework, and show that, if the ontology is specified in the lightweight Description Logic DL-LiteR, the problem can be solved by reformulating the user query into a new spanner. Interestingly, both the spanners used in the mapping and the one computed by the rewriting algorithm have the same expressiveness, and CQ answering in this case is polynomial in data complexity.

[1]  RONALD FAGIN,et al.  Document Spanners , 2015, J. ACM.

[2]  Alon Y. Halevy,et al.  Principles of Data Integration , 2012 .

[3]  James H. Martin,et al.  Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[4]  Daniel Jurafsky,et al.  Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 2nd Edition , 2009, Prentice Hall series in artificial intelligence.

[5]  Alon Y. Halevy,et al.  MiniCon: A scalable algorithm for answering queries using views , 2000, The VLDB Journal.

[6]  Luke S. Zettlemoyer,et al.  Knowledge-Based Weak Supervision for Information Extraction of Overlapping Relations , 2011, ACL.

[7]  Diego Calvanese,et al.  Tractable Reasoning and Efficient Query Answering in Description Logics: The DL-Lite Family , 2007, Journal of Automated Reasoning.

[8]  Jeffrey F. Naughton,et al.  Declarative Information Extraction Using Datalog with Embedded Extraction Predicates , 2007, VLDB.

[9]  Sebastian Rudolph,et al.  Query Answering in the Horn Fragments of the Description Logics SHOIQ and SROIQ , 2011, IJCAI.

[10]  Hamish Cunningham,et al.  GATE-a General Architecture for Text Engineering , 1996, COLING.

[11]  Franz Baader,et al.  Pushing the EL Envelope , 2005, IJCAI.

[12]  Ralph Grishman,et al.  Message Understanding Conference- 6: A Brief History , 1996, COLING.

[13]  Diego Calvanese,et al.  Ontology-Based Data Access: A Survey , 2018, IJCAI.

[14]  Boris Motik,et al.  OWL 2: The next step for OWL , 2008, J. Web Semant..

[15]  Diego Calvanese,et al.  OBDA Beyond Relational DBs: A Study for MongoDB , 2016, Description Logics.

[16]  Boris Motik,et al.  Data Complexity of Reasoning in Very Expressive Description Logics , 2005, IJCAI.

[17]  Evgenij Thorstensen,et al.  Mapping Analysis in Ontology-based Data Access: Algorithms and Complexity (Extended Abstract) , 2015, Description Logics.

[18]  Maurizio Lenzerini,et al.  Data integration: a theoretical perspective , 2002, PODS.

[19]  Dejing Dou,et al.  Ontology-based information extraction: An introduction and a survey of current approaches , 2010, J. Inf. Sci..

[20]  Diego Calvanese,et al.  The Description Logic Handbook: Theory, Implementation, and Applications , 2003, Description Logic Handbook.

[21]  Dayne Freitag,et al.  Machine Learning for Information Extraction in Informal Domains , 2000, Machine Learning.

[22]  Ronald Fagin,et al.  Data exchange: semantics and query answering , 2003, Theor. Comput. Sci..

[23]  Bernardo Cuenca Grau,et al.  OWL 2 Web Ontology Language: Profiles , 2009 .

[24]  Diego Calvanese,et al.  Ontology-Based Data Access and Integration , 2018, Encyclopedia of Database Systems.

[25]  Diego Calvanese,et al.  Linking Data to Ontologies , 2008, J. Data Semant..

[26]  Sunita Sarawagi,et al.  Information Extraction , 2008 .

[27]  Frederick Reiss,et al.  Declarative Cleaning of Inconsistencies in Information Extraction , 2016, TODS.

[28]  Frederick Reiss,et al.  SystemT: An Algebraic Approach to Declarative Information Extraction , 2010, ACL.

[29]  Benny Kimelfeld,et al.  Database principles in information extraction , 2014, PODS.

[30]  Diego Calvanese,et al.  Using OWL in Data Integration , 2009, Semantic Web Information Management.