Document Spanners for Extracting Incomplete Information: Expressiveness and Complexity

Rule-based information extraction has lately received a fair amount of attention from the database community, with several languages appearing in the last few years. Although information extraction systems are intended to deal with semistructured data, all language proposals introduced so far are designed to output relations, thus making them incapable of handling incomplete information. To remedy the situation, we propose to extend information extraction languages with the ability to use mappings, thus allowing us to work with documents which have missing or optional parts. Using this approach, we simplify the semantics of regex formulas and extraction rules, two previously defined methods for extracting information. We extend them with the ability to handle incomplete data, and study how they compare in terms of expressive power. We also study computational properties of these languages, focusing on the query enumeration problem, as well as satisfiability and containment.

[1]  Stephen Soderland,et al.  Learning Information Extraction Rules for Semi-Structured and Free Text , 1999, Machine Learning.

[2]  Dominik D. Freydenberger A Logic for Document Spanners , 2018, Theory of Computing Systems.

[3]  Frederick Reiss,et al.  SystemT: a system for declarative information extraction , 2009, SGMD.

[4]  Robert E. Tarjan,et al.  Depth-First Search and Linear Graph Algorithms , 1972, SIAM J. Comput..

[5]  Thomas Schwentick,et al.  Complexity of Decision Problems for XML Schemas and Chain Regular Expressions , 2009, SIAM J. Comput..

[6]  Frederick Reiss,et al.  Cleaning inconsistencies in information extraction via prioritized repairs , 2014, PODS.

[7]  Marcelo Arenas,et al.  A framework for annotating CSV-like data , 2016, Proc. VLDB Endow..

[8]  Benny Kimelfeld,et al.  Database principles in information extraction , 2014, PODS.

[9]  Mihalis Yannakakis,et al.  On Generating All Maximal Independent Sets , 1988, Inf. Process. Lett..

[10]  Raymond J. Mooney,et al.  Relational Learning of Pattern-Match Rules for Information Extraction , 1999, CoNLL.

[11]  Albert R. Meyer,et al.  Word problems requiring exponential time(Preliminary Report) , 1973, STOC.

[12]  Marcelo Arenas,et al.  Semantics and complexity of SPARQL , 2006, TODS.

[13]  Benny Kimelfeld,et al.  Joining Extractions of Regular Expressions , 2017, PODS.

[14]  Jeffrey D. Ullman,et al.  Introduction to Automata Theory, Languages and Computation , 1979 .

[15]  Dominik D. Freydenberger,et al.  Document Spanners: From Expressive Power to Decision Problems , 2017, Theory of Computing Systems.

[16]  Michael R. Fellows,et al.  Parameterized Complexity , 1998 .

[17]  Jörg Flum,et al.  Parameterized Complexity Theory (Texts in Theoretical Computer Science. An EATCS Series) , 2006 .

[18]  RONALD FAGIN,et al.  Document Spanners , 2015, J. ACM.

[19]  Hamish Cunningham,et al.  GATE-a General Architecture for Text Engineering , 1996, COLING.

[20]  Frederick Reiss,et al.  Rule-Based Information Extraction is Dead! Long Live Rule-Based Information Extraction Systems! , 2013, EMNLP.

[21]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[22]  Maurizio Lenzerini,et al.  Data integration: a theoretical perspective , 2002, PODS.

[23]  Jörg Flum,et al.  Parameterized Complexity Theory , 2006, Texts in Theoretical Computer Science. An EATCS Series.

[24]  Jeffrey F. Naughton,et al.  Declarative Information Extraction Using Datalog with Embedded Extraction Predicates , 2007, VLDB.

[25]  Neil Immerman Nondeterministic Space is Closed Under Complementation , 1988, SIAM J. Comput..

[26]  Serge Abiteboul,et al.  Foundations of Databases , 1994 .