A Logic for Document Spanners

Document spanners are a formal framework for information extraction that was introduced by Fagin, Kimelfeld, Reiss, and Vansummeren (PODS 2013, JACM 2015). One of the central models in this framework are core spanners, which formalize the query language AQL that is used in IBM’s SystemT. As shown by Freydenberger and Holldack (ICDT 2016, ToCS 2018), there is a connection between core spanners and ECreg$\phantom {\dot {i}\!}\mathsf {EC}^{\text {reg}}$, the existential theory of concatenation with regular constraints. The present paper further develops this connection by defining SpLog$\phantom {\dot {i}\!}\mathsf {SpLog}$, a fragment of ECreg$\phantom {\dot {i}\!}\mathsf {EC}^{\text {reg}}$ that has the same expressive power as core spanners. This equivalence extends beyond equivalence of expressive power, as we show the existence of polynomial time conversions between SpLog$\phantom {\dot {i}\!}\mathsf {SpLog}$ and core spanners. Consequences and applications include an alternative way of defining relations for spanners, a pumping lemma for core spanners, and insights into the relative succinctness of various classes of spanner representations and their connection to graph querying languages. We also briefly discuss the connection between SpLog$\phantom {\dot {i}\!}\mathsf {SpLog}$ with negation and core spanners with a difference operator.

[1]  Nicole Schweikardt,et al.  Expressiveness and Static Analysis of Extended Conjunctive Regular Path Queries , 2013, AMW.

[2]  Markus Holzer,et al.  From Finite Automata to Regular Expressions and Back - A Summary on Descriptional Complexity , 2014, Int. J. Found. Comput. Sci..

[3]  W. V. Quine,et al.  Concatenation as a basis for arithmetic , 1946, Journal of Symbolic Logic.

[4]  Jeffrey D. Ullman,et al.  Introduction to Automata Theory, Languages and Computation , 1979 .

[5]  Sheila A. Greibach,et al.  A note on undecidable properties of formal languages , 1968, Mathematical systems theory.

[6]  Wojciech Plandowski,et al.  The expressibility of languages and relations by word equations , 1997, JACM.

[7]  David S. Johnson,et al.  Computers and In stractability: A Guide to the Theory of NP-Completeness. W. H Freeman, San Fran , 1979 .

[8]  Frank Neven,et al.  Frontiers of tractability for typechecking simple XML transformations , 2007, J. Comput. Syst. Sci..

[9]  Seymour Ginsburg,et al.  BOUNDED REGULAR SETS , 1966 .

[10]  Elena Czeizler The non-parametrizability of the word equation xyz=zvx: A short proof , 2005, Theor. Comput. Sci..

[11]  Wojciech Plandowski,et al.  Generalized Factorizations of Words and Their Algorithmic Properties , 1999, Theor. Comput. Sci..

[12]  Daniel Reidenbach,et al.  Patterns with bounded treewidth , 2012, Inf. Comput..

[13]  RONALD FAGIN,et al.  Document Spanners , 2015, J. ACM.

[14]  Dominik D. Freydenberger A Logic for Document Spanners , 2018, Theory of Computing Systems.

[15]  Zvi Galil,et al.  Hierarchies of complete problems , 1976, Acta Informatica.

[16]  Volker Diekert,et al.  Solution sets for equations over free groups are EDT0L languages , 2016, Int. J. Algebra Comput..

[17]  Seymour Ginsburg,et al.  The mathematical theory of context free languages , 1966 .

[18]  Ronald L. Rivest,et al.  Introduction to Algorithms, third edition , 2009 .

[19]  Paliath Narendran,et al.  On Extended Regular Expressions , 2009, LATA.

[20]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[21]  Frederick Reiss,et al.  SystemT: A Declarative Information Extraction System , 2011, ACL.

[22]  Dominik D. Freydenberger,et al.  Deterministic regular expressions with back-references , 2019, J. Comput. Syst. Sci..

[23]  Volker Diekert,et al.  More Than 1700 Years of Word Equations , 2015, CAI.

[24]  P. Gács,et al.  Algorithms , 1992 .

[25]  Albert R. Meyer,et al.  Word problems requiring exponential time(Preliminary Report) , 1973, STOC.

[26]  Andrzej Ehrenfeucht,et al.  A Pumping Theorem for EDTOL Languages ; CU-CS-047-74 , 1974 .

[27]  Markus L. Schmid Characterising REGEX languages by regular languages equipped with factor-referencing , 2016, Inf. Comput..

[28]  L. Libkin,et al.  Graph Logics with Rational Relations , 2013, Log. Methods Comput. Sci..

[29]  Thomas J. Schaefer,et al.  The complexity of satisfiability problems , 1978, STOC.

[30]  Wojciech Plandowski,et al.  Two-variable word equations , 2000, RAIRO Theor. Informatics Appl..

[31]  Jean-Camille Birget,et al.  Intersection and Union of Regular Languages and State Complexity , 1992, Inf. Process. Lett..

[32]  Markus Holzer,et al.  Language Operations with Regular Expressions of Polynomial Size , 2008, DCFS.

[33]  Dominik D. Freydenberger Extended Regular Expressions: Succinctness and Decidability , 2011, STACS.

[34]  Arto Salomaa,et al.  Aspects of Classical Language Theory , 1997, Handbook of Formal Languages.

[35]  Juhani Karhumäki,et al.  An Analysis and a Reproof of Hmelevskii's Theorem , 2008, Developments in Language Theory.

[36]  Anthony Widjaja Lin,et al.  Expressive Languages for Path Queries over Graph-Structured Data , 2012, TODS.

[37]  Christian Choffrut,et al.  Combinatorics of Words , 1997, Handbook of Formal Languages.

[38]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[39]  守屋 悦朗,et al.  J.E.Hopcroft, J.D. Ullman 著, "Introduction to Automata Theory, Languages, and Computation", Addison-Wesley, A5変形版, X+418, \6,670, 1979 , 1980 .

[40]  Lucian Ilie Subwords and Power-Free Words are not Expressible by Word Equations , 1999, Fundam. Informaticae.

[41]  Grzegorz Rozenberg,et al.  L Systems , 1974, Handbook of Formal Languages.

[42]  Benny Kimelfeld,et al.  Joining Extractions of Regular Expressions , 2017, PODS.

[43]  M. Lothaire Makanin's Algorithm , 2002 .

[44]  Dominik D. Freydenberger,et al.  Document Spanners: From Expressive Power to Decision Problems , 2017, Theory of Computing Systems.