Document Spanners: From Expressive Power to Decision Problems

We examine document spanners, a formal framework for information extraction that was introduced by Fagin, Kimelfeld, Reiss, and Vansummeren (PODS 2013, JACM 2015). A document spanner is a function that maps an input string to a relation over spans (intervals of positions of the string). We focus on document spanners that are defined by regex formulas, which are basically regular expressions that map matched subexpressions to corresponding spans, and on core spanners, which extend the former by standard algebraic operators and string equality selection. First, we compare the expressive power of core spanners to three models – namely, patterns, word equations, and a rich and natural subclass of extended regular expressions (regular expressions with a repetition operator). These results are then used to analyze the complexity of query evaluation and various aspects of static analysis of core spanners. Finally, we examine the relative succinctness of different kinds of representations of core spanners and relate this to the simplification of core spanners that are extended with difference operators.

[1]  Tao Jiang,et al.  Decision Problems for Patterns , 1995, J. Comput. Syst. Sci..

[2]  RONALD FAGIN,et al.  Document Spanners , 2015, J. ACM.

[3]  Henning Fernau,et al.  Pattern Matching with Variables: Fast Algorithms and New Hardness Results , 2015, STACS.

[4]  Enno Ohlebusch,et al.  On the Equivalence Problem for E-Pattern Languages , 1997, Theor. Comput. Sci..

[5]  Jeffrey E. F. Friedl Mastering Regular Expressions , 1997 .

[6]  Arto Salomaa,et al.  Pattern languages with and without erasing , 1994 .

[7]  Sheng Yu,et al.  A Formal Study Of Practical Regular Expressions , 2003, Int. J. Found. Comput. Sci..

[8]  Henning Fernau,et al.  Pattern matching with variables: A multivariate complexity analysis , 2013, Inf. Comput..

[9]  Nicole Schweikardt,et al.  Expressiveness and Static Analysis of Extended Conjunctive Regular Path Queries , 2013, AMW.

[10]  Dexter Kozen Theory of Computation , 2006, Texts in Computer Science.

[11]  Martin Kutrib,et al.  Descriptional Complexity - An Introductory Survey , 2010, Scientific Applications of Language Methods.

[12]  Henning Fernau,et al.  Pattern matching with variables: A multivariate complexity analysis , 2015, Inf. Comput..

[13]  Daniel Reidenbach,et al.  Patterns with bounded treewidth , 2012, Inf. Comput..

[14]  Dominik D. Freydenberger,et al.  Bad News on Decision Problems for Patterns , 2008, Developments in Language Theory.

[15]  Dominik D. Freydenberger,et al.  Inclusion problems for patterns with a bounded number of variables , 2010, Inf. Comput..

[16]  M. Lothaire,et al.  Combinatorics on words: Frontmatter , 1997 .

[17]  Carlos A. Hurtado,et al.  Edinburgh Research Explorer Expressive Languages for Path Queries over Graph-Structured Data , 2012 .

[18]  Oscar H. Ibarra,et al.  A note on parsing pattern languages , 1995, Pattern Recognit. Lett..

[19]  Frederick Reiss,et al.  Cleaning inconsistencies in information extraction via prioritized repairs , 2014, PODS.

[20]  Pablo Barceló,et al.  Graph logics with rational relations: the role of word combinatorics , 2014, CSL-LICS.

[21]  J. Berstel,et al.  Context-free languages , 1993, SIGA.

[22]  Dominik D. Freydenberger,et al.  Document Spanners: From Expressive Power to Decision Problems , 2016, ICDT.

[23]  Wojciech Plandowski,et al.  The expressibility of languages and relations by word equations , 1997, JACM.

[24]  Tero Harju,et al.  Combinatorics on Words , 2004 .

[25]  Henning Fernau,et al.  On the Parameterised Complexity of String Morphism Problems , 2016, Theory of Computing Systems.

[26]  Dexter Kozen,et al.  Lower bounds for natural proof systems , 1977, 18th Annual Symposium on Foundations of Computer Science (sfcs 1977).

[27]  Jörg Flum,et al.  Parameterized Complexity Theory , 2006, Texts in Theoretical Computer Science. An EATCS Series.

[28]  Juris Hartmanis,et al.  On Gödel Speed-Up and Succinctness of Language Representations , 1983, Theor. Comput. Sci..

[29]  M. Lothaire Makanin's Algorithm , 2002 .

[30]  Dana Angluin,et al.  Finding Patterns Common to a Set of Strings , 1980, J. Comput. Syst. Sci..

[31]  Dominik D. Freydenberger Extended Regular Expressions: Succinctness and Decidability , 2012, Theory of Computing Systems.

[32]  Frederick Reiss,et al.  Declarative Cleaning of Inconsistencies in Information Extraction , 2016, TODS.

[33]  Ryo Yoshinaka,et al.  On the Parameterised Complexity of Learning Patterns , 2011, ISCIS.

[34]  S. Ginsburg,et al.  Semigroups, Presburger formulas, and languages. , 1966 .

[35]  Martin Kutrib The phenomenon of non-recursive trade-offs , 2004, Int. J. Found. Comput. Sci..

[36]  Dominik D. Freydenberger A Logic for Document Spanners , 2018, Theory of Computing Systems.