Constant-Delay Enumeration for Nondeterministic Document Spanners

We consider the information extraction framework known as document spanners, and study the problem of efficiently computing the results of the extraction from an input document, where the extraction task is described as a sequential variable-set automaton (VA). We pose this problem in the setting of enumeration algorithms, where we can first run a preprocessing phase and must then produce the results with a small delay between any two consecutive results. Our goal is to have an algorithm which is tractable in combined complexity, i.e., in the sizes of the input document and the VA; while ensuring the best possible data complexity bounds in the input document size, i.e., constant delay in the document size. Several recent works at PODS'18 proposed such algorithms but with linear delay in the document size or with an exponential dependency in size of the (generally nondeterministic) input VA. In particular, Florenzano et al. suggest that our desired runtime guarantees cannot be met for general sequential VAs. We refute this and show that, given a nondeterministic sequential VA and an input document, we can enumerate the mappings of the VA on the document with the following bounds: the preprocessing is linear in the document size and polynomial in the size of the VA, and the delay is independent of the document and polynomial in the size of the VA. The resulting algorithm thus achieves tractability in combined complexity and the best possible data complexity bounds. Moreover, it is rather easy to describe, in particular for the restricted case of so-called extended VAs.

[1]  Benny Kimelfeld,et al.  Joining Extractions of Regular Expressions , 2017, PODS.

[2]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[3]  Antoine Amarilli,et al.  Enumeration on Trees with Tractable Combined Complexity and Efficient Updates , 2018, PODS.

[4]  Antoine Amarilli,et al.  Enumeration on Trees under Relabelings , 2017, ICDT.

[5]  Robert E. Tarjan,et al.  Bounds on Backtrack Algorithms for Listing Cycles, Paths, and Spanning Trees , 1975, Networks.

[6]  Dominik D. Freydenberger,et al.  Document Spanners: From Expressive Power to Decision Problems , 2017, Theory of Computing Systems.

[7]  Antoine Amarilli,et al.  A Circuit-Based Approach to Efficient Enumeration , 2017, ICALP.

[8]  Shuji Tsukiyama,et al.  A New Algorithm for Generating All the Maximal Independent Sets , 1977, SIAM J. Comput..

[9]  Luc Segoufin A glimpse on constant delay enumeration (Invited Talk) , 2014, STACS.

[10]  Cristian Riveros,et al.  Document Spanners for Extracting Incomplete Information: Expressiveness and Complexity , 2018, PODS.

[11]  Wim Martens,et al.  MSO queries on trees: enumerating answers under updates , 2014, CSL-LICS.

[12]  Nicole Schweikardt,et al.  Answering UCQs under updates and in the presence of integrity constraints , 2017, ICDT.

[13]  Stijn Vansummeren,et al.  Constant Delay Algorithms for Regular Document Spanners , 2018, PODS.

[14]  Nicole Schweikardt,et al.  Answering FO+MOD Queries under Updates on Bounded Degree Databases , 2017, ICDT.

[15]  Leslie G. Valiant,et al.  The Complexity of Computing the Permanent , 1979, Theor. Comput. Sci..

[16]  Matthias Niewerth,et al.  MSO Queries on Trees: Enumerating Answers under Updates Using Forest Algebras , 2018, LICS.

[17]  RONALD FAGIN,et al.  Document Spanners , 2015, J. ACM.

[18]  Dominik D. Freydenberger A Logic for Document Spanners , 2018, Theory of Computing Systems.

[19]  Shlomo Argamon,et al.  Effects of Age and Gender on Blogging , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[20]  Guillaume Bagan,et al.  MSO Queries on Tree Decomposable Structures Are Computable with Linear Delay , 2006, CSL.

[21]  Etienne Grandjean,et al.  Sorting, linear time and the satisfiability problem , 1996, Annals of Mathematics and Artificial Intelligence.

[22]  François Le Gall Improved output-sensitive quantum algorithms for Boolean matrix multiplication , 2012, SODA.

[23]  François Le Gall,et al.  Powers of tensors and fast matrix multiplication , 2014, ISSAC.

[24]  Kunihiro Wasa,et al.  Enumeration of Enumeration Algorithms , 2016, ArXiv.

[25]  Yann Strozecki,et al.  Efficient enumeration of solutions produced by closure operations , 2015, STACS.

[26]  Luc Segoufin,et al.  Enumeration of MSO Queries on Strings with Constant Delay and Logarithmic Updates , 2018, PODS.

[27]  Nicole Schweikardt,et al.  Answering Conjunctive Queries under Updates , 2017, PODS.