Beyond Equi-joins: Ranking, Enumeration and Factorization

We study theta-joins in general and join predicates with conjunctions and disjunctions of inequalities in particular, focusing on ranked enumeration where the answers are returned incrementally in an order dictated by a given ranking function. Our approach achieves strong time and space complexity properties: with n denoting the number of tuples in the database, we guarantee for acyclic full join queries with inequality conditions that for every value of k, the k top-ranked answers are returned in O(npolylogn+klogk) time. This is within a polylogarithmic factor of O(n+klogk), i.e., the best known complexity for equi-joins, and even of O(n+k), i.e., the time it takes to look at the input and return k answers in any order. Our guarantees extend to join queries with selections and many types of projections (namely those called “free-connex” queries and those that use bag semantics). Remarkably, they hold even when the number of join results is nℓ for a join of ℓ relations. The key ingredient is a novel O(npolylogn)-size factorized representation of the query output, which is constructed on-the-fly for a given query and database. In addition to providing the first non-trivial theoretical guarantees beyond equi-joins, we show in an experimental study that our ranked-enumeration approach is also memory-efficient and fast in practice, beating the running time of state-of-the-art database systems by orders of magnitude.

[1]  Luc Segoufin,et al.  Constant Delay Enumeration for Conjunctive Queries , 2015, SGMD.

[2]  Patrick K. Nicholson,et al.  Any-k: Anytime Top-k Tree Pattern Retrieval in Labeled Graphs , 2018, WWW.

[3]  John R. Smith,et al.  Supporting Incremental Join Queries on Ranked Inputs , 2001, VLDB.

[4]  Wolfgang Gatterbauer,et al.  Optimal Join Algorithms Meet Top-k , 2020, SIGMOD Conference.

[5]  Bernard Chazelle,et al.  A Functional Approach to Data Structures and Its Use in Multidimensional Searching , 1988, SIAM J. Comput..

[6]  Dimitrios Gunopulos,et al.  Answering top-k queries using views , 2006, VLDB.

[7]  Jakub Závodný,et al.  Factorised representations of query results: size bounds and readability , 2012, ICDT '12.

[8]  Paolo Papotti,et al.  Fast and scalable inequality joins , 2017, The VLDB Journal.

[9]  Divesh Srivastava,et al.  Processing top-k join queries , 2010, Proc. VLDB Endow..

[10]  Dániel Marx,et al.  Tractable Hypergraph Properties for Constraint Satisfaction and Conjunctive Queries , 2009, JACM.

[11]  Man Lung Yiu,et al.  Efficient top-k aggregation of ranked inputs , 2007, TODS.

[12]  Jakub Závodný,et al.  Aggregation and Ordering in Factorised Databases , 2013, Proc. VLDB Endow..

[13]  Hung Q. Ngo,et al.  Worst-Case Optimal Join Algorithms: Techniques, Results, and Open Problems , 2018, PODS.

[14]  Dan Olteanu,et al.  Using OBDDs for Efficient Query Evaluation on Probabilistic Databases , 2008, SUM.

[15]  Robert E. Tarjan,et al.  Simple Linear-Time Algorithms to Test Chordality of Graphs, Test Acyclicity of Hypergraphs, and Selectively Reduce Acyclic Hypergraphs , 1984, SIAM J. Comput..

[16]  M. Naor,et al.  Optimal aggregation algorithms for middleware , 2001, PODS '01.

[17]  John R. Smith,et al.  The onion technique: indexing for linear optimization queries , 2000, SIGMOD '00.

[18]  Shaleen Deep,et al.  Ranked Enumeration of Conjunctive Query Results , 2019, ArXiv.

[19]  Dan Olteanu,et al.  Learning Linear Regression Models over Factorized Joins , 2016, SIGMOD Conference.

[20]  Mehryar Mohri,et al.  Semiring Frameworks and Algorithms for Shortest-Distance Problems , 2002, J. Autom. Lang. Comb..

[21]  Mirek Riedewald,et al.  Tractable Orders for Direct Access to Ranked Answers of Conjunctive Queries , 2020, PODS.

[22]  Johann Brault-Baron,et al.  De la pertinence de l'énumération : complexité en logiques propositionnelle et du premier ordre. (The relevance of the list: propositional logic and complexity of the first order) , 2013 .

[23]  Dan Suciu,et al.  Answering Conjunctive Queries with Inequalities , 2016, Theory of Computing Systems.

[24]  Dan Suciu,et al.  Boolean Tensor Decomposition for Conjunctive Queries with Negation , 2017, ICDT.

[25]  Jakub Závodný,et al.  On Factorisation of Provenance Polynomials , 2011, TaPP.

[26]  Dániel Marx,et al.  Size Bounds and Query Plans for Relational Joins , 2008, 2008 49th Annual IEEE Symposium on Foundations of Computer Science.

[27]  Johann Brault-Baron,et al.  Hypergraph Acyclicity Revisited , 2014, ACM Comput. Surv..

[28]  Hung Q. Ngo,et al.  In-Database Learning with Sparse Tensors , 2017, PODS.

[29]  Dan E. Willard Applications of Range Query Theory to Relational Data Base Join and Selection Operations , 1996, J. Comput. Syst. Sci..

[30]  Richard Pavley,et al.  A Method for the Solution of the Nth Best Path Problem , 1959, JACM.

[31]  Wolfgang Gatterbauer,et al.  Optimal Algorithms for Ranked Enumeration of Answers to Full Conjunctive Queries , 2019, Proc. VLDB Endow..

[32]  Markus Kröll,et al.  On the Enumeration Complexity of Unions of Conjunctive Queries , 2018, PODS.

[33]  Mihalis Yannakakis,et al.  On the Complexity of Database Queries , 1999, J. Comput. Syst. Sci..

[34]  Dan Olteanu,et al.  Secondary-storage confidence computation for conjunctive queries with inequalities , 2009, SIGMOD Conference.

[35]  Guido Moerkotte,et al.  Efficient Evaluation of Aggregates on Bulk Types , 1995, DBPL.

[36]  Thomas Seidl,et al.  Joining interval data in relational databases , 2004, SIGMOD '04.

[37]  Wolfgang Gatterbauer,et al.  Factorized Graph Representations for Semi-Supervised Learning from Sparse Data , 2020, SIGMOD Conference.

[38]  Peter L. Hammer,et al.  Boolean Functions - Theory, Algorithms, and Applications , 2011, Encyclopedia of mathematics and its applications.

[39]  Clement T. Yu,et al.  An algorithm for tree-query membership of a distributed query , 1979, COMPSAC.

[40]  Gonzalo Navarro,et al.  Optimal Joins using Compact Data Structures , 2019, ICDT.

[41]  Yi Lu,et al.  Path Problems in Temporal Graphs , 2014, Proc. VLDB Endow..

[42]  Timothy M. Chan,et al.  Necklaces, Convolutions, and X+Y , 2006, Algorithmica.

[43]  Wolfgang Lehner,et al.  General dynamic Yannakakis: conjunctive queries with theta joins under updates , 2019, The VLDB Journal.

[44]  F. Frances Yao,et al.  Computational Geometry , 1991, Handbook of Theoretical Computer Science, Volume A: Algorithms and Complexity.

[45]  Neoklis Polyzotis,et al.  Robust and efficient algorithms for rank join evaluation , 2009, SIGMOD Conference.

[46]  Jakub Závodný,et al.  FDB: A Query Engine for Factorised Relational Databases , 2012, Proc. VLDB Endow..

[47]  Dan E. Willard,et al.  An Algorithm for Handling Many Relational Calculus Queries Efficiently , 2002, J. Comput. Syst. Sci..

[48]  Yufei Tao,et al.  A Guide to Designing Top-k Indexes , 2019, SGMD.

[49]  Arnaud Durand,et al.  On Acyclic Conjunctive Queries and Constant Delay Enumeration , 2007, CSL.

[50]  Wolfgang Gatterbauer,et al.  Any-k Algorithms for Exploratory Analysis with Conjunctive Queries , 2018, ExploreDB@SIGMOD/PODS.

[51]  Jeffrey Xu Yu,et al.  Optimal Enumeration: Efficient Top-k Tree Matching , 2015, Proc. VLDB Endow..

[52]  Dan Olteanu,et al.  Covers of Query Results , 2017, ICDT.

[53]  Atri Rudra,et al.  Skew strikes back: new developments in the theory of join algorithms , 2013, SGMD.

[54]  Jure Leskovec,et al.  Community Interaction and Conflict on the Web , 2018, WWW.

[55]  Maarten Löffler,et al.  Range Searching , 2016, Encyclopedia of Algorithms.

[56]  Atri Rudra,et al.  Beyond worst-case analysis for joins with minesweeper , 2014, PODS.

[57]  Benjamin Moseley,et al.  On Functional Aggregate Queries with Additive Inequalities , 2018, PODS.

[58]  Ihab F. Ilyas,et al.  A survey of top-k query processing techniques in relational database systems , 2008, CSUR.

[59]  David J. DeWitt,et al.  An Evaluation of Non-Equijoin Algorithms , 1991, VLDB.

[60]  Walid G. Aref,et al.  Supporting top-kjoin queries in relational databases , 2004, The VLDB Journal.

[61]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[62]  Shaleen Deep,et al.  Compressed Representations of Conjunctive Query Results , 2017, PODS.

[63]  Jack W. Stokes,et al.  Latte: Large-Scale Lateral Movement Detection , 2018, MILCOM 2018 - 2018 IEEE Military Communications Conference (MILCOM).

[64]  Jeffrey F. Naughton,et al.  Learning Generalized Linear Models Over Normalized Data , 2015, SIGMOD Conference.

[65]  David Eppstein,et al.  Finding the k Shortest Paths , 1999, SIAM J. Comput..

[66]  Pankaj K. Agarwal,et al.  Dynamic Enumeration of Similarity Joins , 2021, ICALP.

[67]  Atri Rudra,et al.  Joins via Geometric Resolutions: Worst-case and Beyond , 2014, PODS.

[68]  Vagelis Hristidis,et al.  PREFER: a system for the efficient execution of multi-parametric ranked queries , 2001, SIGMOD '01.

[69]  R. Varshney,et al.  Supporting top-k join queries in relational databases , 2011 .

[70]  Michel Minoux,et al.  Graphs, dioids and semirings : new models and algorithms , 2008 .

[71]  Stefan Manegold,et al.  Progressive Join Algorithms Considering User Preference , 2021, CIDR.

[72]  Andrés Marzal,et al.  Computing the K Shortest Paths: A New Algorithm and an Experimental Comparison , 1999, WAE.

[73]  Wolfgang Gatterbauer,et al.  Near-Optimal Distributed Band-Joins through Recursive Partitioning , 2020, SIGMOD Conference.

[74]  Nicole Schweikardt,et al.  Answering Conjunctive Queries under Updates , 2017, PODS.

[75]  Markus Kröll,et al.  Enumeration Complexity of Conjunctive Queries with Functional Dependencies , 2018, ICDT.

[76]  Arnaud Durand,et al.  Fine-Grained Complexity Analysis of Queries: From Decision to Counting and Enumeration , 2020, PODS.

[77]  Nicole Schweikardt,et al.  Answering (Unions of) Conjunctive Queries using Random Access and Random-Order Enumeration , 2019, PODS.

[78]  Dan Olteanu,et al.  Factorized Databases , 2016, SGMD.

[79]  Wolfgang Gatterbauer,et al.  Towards a Dichotomy for Minimally Factorizing the Provenance of Self-Join Free Conjunctive Queries , 2021, ArXiv.

[80]  Wolfgang Lehner,et al.  Efficient Query Processing for Dynamically Changing Datasets , 2019, SGMD.

[81]  Todd L. Veldhuizen,et al.  Leapfrog Triejoin: A Simple, Worst-Case Optimal Join Algorithm , 2012, 1210.0481.

[82]  Jakub Závodný,et al.  Size Bounds for Factorised Representations of Query Results , 2015, TODS.

[83]  E. Lawler A PROCEDURE FOR COMPUTING THE K BEST SOLUTIONS TO DISCRETE OPTIMIZATION PROBLEMS AND ITS APPLICATION TO THE SHORTEST PATH PROBLEM , 1972 .

[84]  Moshe Y. Vardi The complexity of relational query languages (Extended Abstract) , 1982, STOC '82.

[85]  Nicole Schweikardt,et al.  Constant Delay Enumeration with FPT-Preprocessing for Conjunctive Queries of Bounded Submodular Width , 2020, MFCS.

[86]  D. Gifford 1962 , 1962, The Selected Letters of John Berryman.

[87]  Dan Olteanu,et al.  F: Regression Models over Factorized Views , 2016, Proc. VLDB Endow..

[88]  Georg Gottlob,et al.  Hypertree Decompositions: Questions and Answers , 2016, PODS.

[89]  Dan Suciu,et al.  What Do Shannon-type Inequalities, Submodular Width, and Disjunctive Datalog Have to Do with One Another? , 2016, PODS.

[90]  A. Foran,et al.  Quicksort , 1962, Comput. J..

[91]  Mam Riess Jones Color Coding , 1962, Human factors.

[92]  Mihalis Yannakakis,et al.  Algorithms for Acyclic Database Schemes , 1981, VLDB.