Document Spanners

An intrinsic part of information extraction is the creation and manipulation of relations extracted from text. In this article, we develop a foundational framework where the central construct is what we call a document spanner (or just spanner for short). A spanner maps an input string into a relation over the spans (intervals specified by bounding indices) of the string. The focus of this article is on the representation of spanners. Conceptually, there are two kinds of such representations. Spanners defined in a primitive representation extract relations directly from the input string; those defined in an algebra apply algebraic operations to the primitively represented spanners. This framework is driven by SystemT, an IBM commercial product for text analysis, where the primitive representation is that of regular expressions with capture variables. We define additional types of primitive spanner representations by means of two kinds of automata that assign spans to variables. We prove that the first kind has the same expressive power as regular expressions with capture variables; the second kind expresses precisely the algebra of the regular spanners—the closure of the first kind under standard relational operators. The core spanners extend the regular ones by string-equality selection (an extension used in SystemT). We give some fundamental results on the expressiveness of regular and core spanners. As an example, we prove that regular spanners are closed under difference (and complement), but core spanners are not. Finally, we establish connections with related notions in the literature.

[1]  Dominik D. Freydenberger Extended Regular Expressions: Succinctness and Decidability , 2011, STACS.

[2]  Ellen Riloff,et al.  Automatically Constructing a Dictionary for Information Extraction Tasks , 1993, AAAI.

[3]  Anthony J. Bonner,et al.  Sequences, Datalog, and Transducers , 1998, J. Comput. Syst. Sci..

[4]  Dominik D. Freydenberger Extended Regular Expressions: Succinctness and Decidability , 2012, Theory of Computing Systems.

[5]  Jeffrey E. F. Friedl Mastering Regular Expressions , 1997 .

[6]  Jean Berstel,et al.  Transductions and context-free languages , 1979, Teubner Studienbücher : Informatik.

[7]  Pablo Barceló,et al.  Graph Logics with Rational Relations and the Generalized Intersection Problem , 2012, 2012 27th Annual IEEE Symposium on Logic in Computer Science.

[8]  Thomas Schwentick,et al.  Definable relations and first-order query languages over strings , 2003, JACM.

[9]  Carlos A. Hurtado,et al.  Edinburgh Research Explorer Expressive Languages for Path Queries over Graph-Structured Data , 2012 .

[10]  Donald E. Knuth,et al.  Semantics of context-free languages , 1968, Mathematical systems theory.

[11]  Alberto O. Mendelzon,et al.  A graphical query language supporting recursion , 1987, SIGMOD '87.

[12]  Sheng Yu,et al.  A Formal Study Of Practical Regular Expressions , 2003, Int. J. Found. Comput. Sci..

[13]  Diego Calvanese,et al.  Containment of Conjunctive Regular Path Queries with Inverse , 2000, KR.

[14]  Frederick Reiss,et al.  SystemT: An Algebraic Approach to Declarative Information Extraction , 2010, ACL.

[15]  Frederick Reiss,et al.  SystemT: a system for declarative information extraction , 2009, SGMD.

[16]  Frank Neven,et al.  Expressiveness of structured document query languages based on attribute grammars , 1998, JACM.

[17]  Dan Suciu,et al.  Query containment for conjunctive queries with regular expressions , 1998, PODS.

[18]  Thomas Schwentick,et al.  Query automata over finite trees , 2002, Theor. Comput. Sci..

[19]  Pablo Barceló,et al.  Parameterized regular expressions and their languages , 2011, Theor. Comput. Sci..

[20]  Diego Calvanese,et al.  View-based query processing and constraint satisfaction , 2000, Proceedings Fifteenth Annual IEEE Symposium on Logic in Computer Science (Cat. No.99CB36332).

[21]  Alberto O. Mendelzon,et al.  GraphLog: a visual formalism for real life recursion , 1990, PODS '90.

[22]  Frederick Reiss,et al.  An Algebraic Approach to Rule-Based Information Extraction , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[23]  Dayne Freitag,et al.  Toward General-Purpose Learning for Information Extraction , 1998, ACL.

[24]  Douglas E. Appelt,et al.  The Common Pattern Specification Language , 1998, TIPSTER.

[25]  Jan Chomicki,et al.  Consistent query answers in inconsistent databases , 1999, PODS '99.

[26]  Orna Grumberg,et al.  Variable Automata over Infinite Alphabets , 2010, LATA.

[27]  Anthony J. Bonner,et al.  Sequences, Datalog and transducers , 1995, PODS '95.

[28]  Seymour Ginsburg,et al.  Regular Sequence Operations and Their Use in Database Queries , 1998, J. Comput. Syst. Sci..

[29]  Shou-Feng Wang,et al.  𝒫𝒮-regular languages , 2011, Int. J. Comput. Math..

[30]  Andrew McCallum,et al.  Maximum Entropy Markov Models for Information Extraction and Segmentation , 2000, ICML.

[31]  Jan Chomicki,et al.  Prioritized repairing and consistent query answering in relational databases , 2012, Annals of Mathematics and Artificial Intelligence.

[32]  James F. Allen Maintaining knowledge about temporal intervals , 1983, CACM.

[33]  Peter Linz An introduction to formal languages and automata (2nd ed.) , 1996 .

[34]  Tim Leek,et al.  Information Extraction Using Hidden Markov Models , 1997 .

[35]  Frederick Reiss,et al.  Automatic rule refinement for information extraction , 2010, Proc. VLDB Endow..

[36]  Alin Deutsch,et al.  Optimization Properties for Classes of Conjunctive Regular Path Queries , 2001, DBPL.

[37]  Cezar Câmpeanu,et al.  On the intersection of regex languages with regular languages , 2009, Theor. Comput. Sci..

[38]  David Fisher,et al.  CRYSTAL: Inducing a Conceptual Dictionary , 1995, IJCAI.

[39]  Kazem Taghva,et al.  Address extraction using hidden Markov models , 2005, IS&T/SPIE Electronic Imaging.

[40]  Frederick Reiss,et al.  Cleaning inconsistencies in information extraction via prioritized repairs , 2014, PODS.

[41]  Donald E. Knuth Semantics of context-free languages: Correction , 2005, Mathematical systems theory.

[42]  Dan Suciu,et al.  Journal of the ACM , 2006 .

[43]  Ken Moody An Introduction to Formal Languages and Automata , 1992 .

[44]  Donald E. Knuth,et al.  Correction: Semantics of Context-Free Languages , 1971, Mathematical Systems Theory.

[45]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[46]  Frederick Reiss,et al.  Spanners: a formal framework for information extraction , 2013, PODS '13.

[47]  Alfred V. Aho,et al.  Algorithms for Finding Patterns in Strings , 1991, Handbook of Theoretical Computer Science, Volume A: Algorithms and Complexity.

[48]  Peter Linz,et al.  An Introduction to Formal Languages and Automata , 1997 .

[49]  Ralph Grishman,et al.  Message Understanding Conference- 6: A Brief History , 1996, COLING.

[50]  Jorge E. Mezei,et al.  On Relations Defined by Generalized Finite Automata , 1965, IBM J. Res. Dev..

[51]  Maurice Nivat,et al.  Transduction des langages de Chomsky , 1968 .

[52]  Matti Nykänen,et al.  Reasoning about strings in databases , 1994, PODS '94.

[53]  Paliath Narendran,et al.  On Extended Regular Expressions , 2009, LATA.

[54]  Hamish Cunningham,et al.  GATE-a General Architecture for Text Engineering , 1996, COLING.