Tight lower bounds for query processing on streaming and external memory data

It is generally assumed that databases have to reside in external, inexpensive storage because of their sheer size. Current technology for external storage systems presents us with a reality that, performance-wise, a small number of sequential scans of the data is strictly preferable over random data accesses. Database technology-in particular query processing technology-has developed around a notion of memory hierarchies with layers of greatly varying sizes and access times. It seems that the current technologies scale up to their tasks and are very successful, but on closer investigation it may appear that our theoretical understanding of the problems involved-and of optimal algorithms for these problems-is not quite as developed. Recently, data stream processing has become an object of study by the database management community, but from the viewpoint of database theory, this is really a special case of the query processing problem on data in external storage where we are limited to a single scan of the input data. In the present paper we study a clean machine model for external memory and stream processing. We establish tight bounds for the data complexity of Core XPath evaluation and filtering. We show that the number of scans of the external data induces a strict hierarchy (as long as internal memory space is sufficiently small, e.g., polylogarithmic in the size of the input). We also show that neither joins nor sorting are feasible if the product of the number r(n) of scans of the external memory and the size s(n) of the internal memory buffers is sufficiently small, i.e., of size o(n).

[1]  Noga Alon,et al.  The Space Complexity of Approximating the Frequency Moments , 1999 .

[2]  Frank Neven,et al.  Automata theory for XML researchers , 2002, SGMD.

[3]  Peter van Emde Boas,et al.  Machine Models and Simulation , 1990, Handbook of Theoretical Computer Science, Volume A: Algorithms and Complexity.

[4]  Thomas Schwentick,et al.  Expressive and efficient pattern languages for tree-structured data (extended abstract) , 2000, PODS '00.

[5]  Nicole Schweikardt,et al.  Reversal complexity revisited , 2006, Theor. Comput. Sci..

[6]  Goetz Graefe,et al.  Query evaluation techniques for large databases , 1993, CSUR.

[7]  Andrew Chi-Chih Yao,et al.  Some complexity questions related to distributive computing(Preliminary Report) , 1979, STOC.

[8]  Nicole Schweikardt,et al.  Tight lower bounds for query processing on streaming and external memory data , 2005, Theor. Comput. Sci..

[9]  Serge Abiteboul,et al.  Foundations of Databases , 1994 .

[10]  James W. Thatcher,et al.  Generalized finite automata theory with an application to a decision problem of second-order logic , 1968, Mathematical systems theory.

[11]  Christoph Koch,et al.  Efficient Processing of Expressive Node-Selecting Queries on XML Data in Secondary Storage: A Tree Automata-based Approach , 2003, VLDB.

[12]  Wolfgang Thomas,et al.  Languages, Automata, and Logic , 1997, Handbook of Formal Languages.

[13]  Christoph Koch,et al.  Query evaluation on compressed trees , 2003, 18th Annual IEEE Symposium of Logic in Computer Science, 2003. Proceedings..

[14]  John Doner,et al.  Tree Acceptors and Some of Their Applications , 1970, J. Comput. Syst. Sci..

[15]  Zvi Galil,et al.  Lower Bounds on Communication Complexity , 1987, Inf. Comput..

[16]  Victor Vianu,et al.  Validating streaming XML documents , 2002, PODS.

[17]  Marcus Fontoura,et al.  On the memory requirements of XPath evaluation over XML streams , 2004, PODS.

[18]  Nicole Schweikardt,et al.  The Complexity of Querying External Memory and Streaming Data , 2005, FCT.

[19]  Frank Neven,et al.  Expressiveness of structured document query languages based on attribute grammars , 2002, J. ACM.

[20]  Alexander A. Razborov,et al.  Applications of matrix methods to the theory of lower bounds in computational complexity , 1990, Comb..

[21]  Luc Segoufin,et al.  Typing and querying XML documents: some complexity bounds , 2003, PODS.

[22]  Jeffrey D. Ullman,et al.  Some Results on Tape-Bounded Turing Machines , 1969, JACM.

[23]  Nicole Schweikardt,et al.  Randomized computations on large data sets: tight lower bounds , 2006, PODS.

[24]  Thomas Schwentick,et al.  Query automata over finite trees , 2002, Theor. Comput. Sci..

[25]  Georg Gottlob,et al.  The complexity of XPath query evaluation , 2003, PODS.

[26]  Dan Suciu,et al.  Processing XML Streams with Deterministic Automata , 2003, ICDT.

[27]  Mayur Datar,et al.  On the streaming model augmented with a sorting primitive , 2004, 45th Annual IEEE Symposium on Foundations of Computer Science.

[28]  Jennifer Widom,et al.  Characterizing memory requirements for queries over continuous data streams , 2002, PODS '02.

[29]  J. Van Leeuwen,et al.  Handbook of theoretical computer science - Part A: Algorithms and complexity; Part B: Formal models and semantics , 1990 .

[30]  Georg Gottlob,et al.  Efficient Algorithms for Processing XPath Queries , 2002, VLDB.

[31]  Jianer Chen,et al.  Reversal Complexity , 2015, SIAM J. Comput..

[32]  Prabhakar Raghavan,et al.  Computing on data streams , 1999, External Memory Algorithms.

[33]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[34]  Maarten Marx,et al.  First Order Paths in Ordered Trees , 2005, ICDT.

[35]  J. Ian Munro,et al.  Selection and sorting with limited storage , 1978, 19th Annual Symposium on Foundations of Computer Science (sfcs 1978).

[36]  Nicole Schweikardt,et al.  Lower bounds for sorting with few random accesses to external memory , 2005, PODS.

[37]  Eyal Kushilevitz,et al.  Communication Complexity , 1997, Adv. Comput..

[38]  Raghu Ramakrishnan,et al.  Database Management Systems , 1976 .

[39]  Marcus Fontoura,et al.  Buffering in query evaluation over XML streams , 2005, PODS '05.

[40]  Helmut Seidl,et al.  Locating Matches of Tree Patterns in Forests , 1998, FSTTCS.

[41]  Jeffrey Scott Vitter,et al.  External memory algorithms and data structures: dealing with massive data , 2001, CSUR.

[42]  Grzegorz Rozenberg,et al.  Handbook of Formal Languages , 1997, Springer Berlin Heidelberg.