Skew strikes back: new developments in the theory of join algorithms

Evaluating the relational join is one of the central algorithmic and most well-studied problems in database systems. A staggering number of variants have been considered including Block-Nested loop join, Hash-Join, Grace, Sort-merge (see Grafe [17] for a survey, and [4, 7, 24] for discussions of more modern issues). Commercial database engines use finely tuned join heuristics that take into account a wide variety of factors including the selectivity of various predicates, memory, IO, etc. This study of join queries notwithstanding, the textbook description of join processing is suboptimal. This survey describes recent results on join algorithms that have provable worst-case optimality runtime guarantees. We survey recent work and provide a simpler and unified description of these algorithms that we hope is useful for theory-minded readers, algorithm designers, and systems implementors. Much of this progress can be understood by thinking about a simple join evaluation problem that we illustrate with the so-called triangle query, a query that has become increasingly popular in the last decade with the advent of social networks, biological motifs, and graph databases [36, 37]

[1]  Johannes Gehrke,et al.  Database management systems (3. ed.) , 2003 .

[2]  Phokion G. Kolaitis,et al.  Conjunctive-query containment and constraint satisfaction , 1998, PODS.

[3]  Francesco Scarcello,et al.  Query answering exploiting structural properties , 2005, SGMD.

[4]  Ronald Fagin,et al.  Degrees of acyclicity for hypergraphs and relational database schemes , 1983, JACM.

[5]  B. Bollobás,et al.  Projections of Bodies and Hereditary Properties of Hypergraphs , 1995 .

[6]  Pradeep Dubey,et al.  Sort vs. Hash Revisited: Fast Join Implementation on Modern Multi-Core CPUs , 2009, Proc. VLDB Endow..

[7]  Johannes Gehrke,et al.  Database Management Systems, -3/E. , 2014 .

[8]  Goetz Graefe,et al.  Query evaluation techniques for large databases , 1993, CSUR.

[9]  Liang Chen,et al.  Handling data skew in parallel joins in shared-nothing systems , 2008, SIGMOD Conference.

[10]  Surajit Chaudhuri,et al.  An overview of query optimization in relational systems , 1998, PODS.

[11]  Serge Abiteboul,et al.  Foundations of Databases , 1994 .

[12]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[13]  Fan Chung Graham,et al.  Some intersection theorems for ordered sets and graphs , 1986, J. Comb. Theory, Ser. A.

[14]  Sergei Vassilvitskii,et al.  Counting triangles and the curse of the last reducer , 2011, WWW.

[15]  Jignesh M. Patel,et al.  Design and evaluation of main memory hash join algorithms for multi-core CPUs , 2011, SIGMOD '11.

[16]  Dániel Marx,et al.  Constraint solving via fractional edge covers , 2006, SODA '06.

[17]  Todd L. Veldhuizen,et al.  Leapfrog Triejoin: a worst-case optimal join algorithm , 2012, ArXiv.

[18]  Clement Yu,et al.  On determining tree query membership of a distributed query , 1980 .

[19]  Moshe Y. Vardi The complexity of relational query languages (Extended Abstract) , 1982, STOC '82.

[20]  Raghu Ramakrishnan,et al.  Database Management Systems , 1976 .

[21]  Paul D. Seymour,et al.  Graph Minors. II. Algorithmic Aspects of Tree-Width , 1986, J. Algorithms.

[22]  David J. DeWitt,et al.  Practical Skew Handling in Parallel Joins , 1992, VLDB.

[23]  Georg Gottlob,et al.  Size and treewidth bounds for conjunctive queries , 2009, JACM.

[24]  J. Kahn,et al.  On the number of copies of one hypergraph in another , 1998 .

[25]  Mihalis Yannakakis,et al.  Algorithms for Acyclic Database Schemes , 1981, VLDB.

[26]  H. Whitney,et al.  An inequality related to the isoperimetric inequality , 1949 .

[27]  Alfred G. Dale,et al.  A Taxonomy and Performance Model of Data Skew Effects in Parallel Joins , 1991, VLDB.

[28]  J. Radhakrishnan Entropy and Counting ∗ , 2001 .

[29]  Mihalis Yannakakis,et al.  On the complexity of database queries (extended abstract) , 1997, PODS.

[30]  Dániel Marx,et al.  Size Bounds and Query Plans for Relational Joins , 2008, 2008 49th Annual IEEE Symposium on Foundations of Computer Science.

[31]  Georg Gottlob,et al.  Hypertree decompositions and tractable queries , 1998, J. Comput. Syst. Sci..

[32]  Thomas Schwentick,et al.  Generalized hypertree decompositions: np-hardness and tractable variants , 2007, PODS '07.

[33]  Georg Gottlob,et al.  Robbers, marshals, and guards: game theoretic and logical characterizations of hypertree width , 2001, PODS '01.

[34]  Dániel Marx,et al.  Approximating fractional hypertree width , 2009, TALG.

[35]  Alfred V. Aho,et al.  The theory of joins in relational data bases , 1977, 18th Annual Symposium on Foundations of Computer Science (sfcs 1977).

[36]  Patricia G. Selinger,et al.  Access path selection in a relational database management system , 1979, SIGMOD '79.

[37]  Catriel Beeri,et al.  A Proof Procedure for Data Dependencies , 1984, JACM.

[38]  Marc Gyssens,et al.  Decomposing Constraint Satisfaction Problems Using Database Techniques , 1994, Artif. Intell..

[39]  Mihalis Yannakakis,et al.  On the Complexity of Database Queries , 1999, J. Comput. Syst. Sci..

[40]  Anand Rajaraman,et al.  Conjunctive query containment revisited , 1997, Theor. Comput. Sci..

[41]  Marc Gyssens,et al.  A Decomposition Methodology for Cyclic Databases , 1982, Advances in Data Base Theory.

[42]  Charalampos E. Tsourakakis Fast Counting of Triangles in Large Real Networks without Counting: Algorithms and Laws , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[43]  David Maier,et al.  Testing implications of data dependencies , 1979, SIGMOD '79.

[44]  Martin Grohe Bounds and Algorithms for Joins via Fractional Edge Covers , 2013, In Search of Elegance in the Theory and Practice of Computation.

[45]  Ashok K. Chandra,et al.  Optimal implementation of conjunctive queries in relational data bases , 1977, STOC '77.