The Complexity of Aggregates over Extractions by Regular Expressions

Regular expressions with capture variables, also known as "regex formulas," extract relations of spans (intervals identified by their start and end indices) from text. Based on these Fagin et al. introduced regular document spanners which are the closure of regex formulas under Relational Algebra. In this work, we study the computational complexity of querying text by aggregate functions, like sum, average or quantiles, on top of regular document spanners. To this end, we formally define aggregate functions over regular document spanners and analyze the computational complexity of exact and approximative computation of the aggregates. To be precise, we show that in a restricted case all aggregates can be computed in polynomial time. In general, however, even though exact computation is intractable, some aggregates can still be approximated with fully polynomial-time randomized approximation schemes (FPRAS).

[1]  Frank Neven,et al.  Split-Correctness in Information Extraction , 2018, PODS.

[2]  Dominik D. Freydenberger A Logic for Document Spanners , 2018, Theory of Computing Systems.

[3]  Eran Yahav,et al.  Extracting Automata from Recurrent Neural Networks Using Queries and Counterexamples , 2017, ICML.

[4]  Wim Martens,et al.  Weight Annotation in Information Extraction , 2020, ICDT.

[5]  Abhinav Verma,et al.  Representing Formal Languages: A Comparison Between Finite Automata and Recurrent Neural Networks , 2019, ICLR.

[6]  K. Y. Cockwell,et al.  Software tools for motif and pattern scanning: program descriptions including a universal sequence reading algorithm , 1989, Comput. Appl. Biosci..

[7]  Benny Kimelfeld,et al.  Joining Extractions of Regular Expressions , 2017, PODS.

[8]  Hsinchun Chen,et al.  Textual analysis of stock market prediction using breaking financial news: The AZFin text system , 2009, TOIS.

[9]  Frederick Reiss,et al.  SystemT: A Declarative Information Extraction System , 2011, ACL.

[10]  Christopher De Sa,et al.  Incremental Knowledge Base Construction Using DeepDive , 2015, Proceedings of the VLDB Endowment International Conference on Very Large Data Bases.

[11]  Kira Radinsky,et al.  Building Causal Graphs from Medical Literature and Electronic Medical Records , 2019, AAAI.

[12]  Sampath Kannan,et al.  Counting and random generation of strings in regular languages , 1995, SODA '95.

[13]  Dominik D. Freydenberger,et al.  Dynamic Complexity of Document Spanners , 2020, ICDT.

[14]  Clemens Lautemann,et al.  BPP and the Polynomial Hierarchy , 1983, Inf. Process. Lett..

[15]  Frederick Reiss,et al.  SystemT: a system for declarative information extraction , 2009, SGMD.

[16]  Mark W. Krentel The complexity of optimization problems , 1986, STOC '86.

[17]  Stijn Vansummeren,et al.  Constant Delay Algorithms for Regular Document Spanners , 2018, PODS.

[18]  Antoine Amarilli,et al.  Constant-Delay Enumeration for Nondeterministic Document Spanners , 2019, ICDT.

[19]  Pedro M. Domingos,et al.  Joint Inference in Information Extraction , 2007, AAAI.

[20]  Marcel Urner,et al.  Handbook Of Theoretical Computer Science Vol A Algorithms And Complexity , 2016 .

[21]  Bruce G. Lindsay,et al.  Approximate medians and other quantiles in one pass and with limited memory , 1998, SIGMOD '98.

[22]  Atri Rudra,et al.  FAQ: Questions Asked Frequently , 2015, PODS.

[23]  RONALD FAGIN,et al.  Document Spanners , 2015, J. ACM.

[24]  Stathis Zachos,et al.  Probabilistic Quantifiers and Games , 1988, J. Comput. Syst. Sci..

[25]  Andrew McCallum,et al.  An Introduction to Conditional Random Fields , 2010, Found. Trends Mach. Learn..

[26]  Christopher Ré,et al.  Snorkel: Rapid Training Data Creation with Weak Supervision , 2017, Proc. VLDB Endow..

[27]  Bin Wu,et al.  Wander Join: Online Aggregation via Random Walks , 2016, SIGMOD Conference.

[28]  Marcelo Arenas,et al.  Efficient Logspace Classes for Enumeration, Counting, and Uniform Generation , 2019, PODS.

[29]  Markus Kröll,et al.  Complexity Bounds for Relational Algebra over Document Spanners , 2019, PODS.

[30]  A. F. Neuwald,et al.  Detecting patterns in protein sequences. , 1994, Journal of molecular biology.

[31]  Harry B. Hunt,et al.  On the equivalence and containment problems for unambiguous regular expressions, grammars, and automata , 1981, 22nd Annual Symposium on Foundations of Computer Science (sfcs 1981).