New algorithms for join and grouping operations

Traditional database query processing relies on three types of algorithms for join and for grouping operations. For joins, index nested loops join exploits an index on its inner input, merge join exploits sorted inputs, and hash join exploits differences in the sizes of the join inputs. For grouping, an index-based algorithm has been used in the past whereas today sort- and hash-based algorithms prevail. Cost-based query optimization chooses the most appropriate algorithm for each query and for each operation. Unfortunately, mistaken algorithm choices during compile-time query optimization are common yet expensive to investigate and to resolve.Our goal is to end mistaken choices among join algorithms and among grouping algorithms by replacing the three traditional types of algorithms with a single one. Like merge join, this new join algorithm exploits sorted inputs. Like hash join, it exploits different input sizes for unsorted inputs. In fact, for unsorted inputs, the cost functions for recursive hash join and for hybrid hash join have guided our search for the new join algorithm. In consequence, the new join algorithm can replace both merge join and hash join in a database management system.The in-memory components of the new join algorithm employ indexes. If the database contains indexes for one (or both) of the inputs, the new join can exploit persistent indexes instead of temporary in-memory indexes. Using database indexes to find matching input records, the new join algorithm can also replace index nested loops join.In addition to join operations, a very similar algorithm supports grouping (“group by” queries in SQL) and duplicate elimination. For unsorted inputs, candidate output records take on the role of one of the inputs in a join operation. Our goal is to define a single grouping algorithm that can replace grouping by repeated index searches, by sorting, and by hashing. In other words, our goal is to end mistaken algorithm choices not only for joins and other binary matching operations but also for grouping and other unary matching operations in database query processing.Finally, these new algorithms can be instrumental for efficient and robust data processing in a map-reduce environment, because ‘map’ and ‘reduce’ operations are similar in essentials to join and grouping operations.Results from an implementation of the core algorithm are reported.

[1]  Kjell Bratbergsengen,et al.  Hashing Methods and Relational Algebra Operations , 1984, VLDB.

[2]  Karen Ward,et al.  Dynamic query evaluation plans , 1989, SIGMOD '89.

[3]  Donald E. Knuth,et al.  The Art of Computer Programming: Volume 3: Sorting and Searching , 1998 .

[4]  David Maier,et al.  Efficient assembly for complex objects , 1991, SIGMOD '91.

[5]  Joseph M. Hellerstein,et al.  Optimization techniques for queries with expensive methods , 1998, TODS.

[6]  Kyuseok Shim,et al.  Including Group-By in Query Optimization , 1994, VLDB.

[7]  Mohamed Ziauddin,et al.  Query processing and optimization in Oracle Rdb , 1996, The VLDB Journal.

[8]  Hidehiko Tanaka,et al.  An Overview of The System Software of A Parallel Relational Database Machine GRACE , 1986, VLDB.

[9]  Goetz Graefe,et al.  Implementing sorting in database systems , 2006, CSUR.

[10]  Harumi A. Kuno,et al.  Visualizing the robustness of query execution , 2009, CIDR.

[11]  Donald D. Chamberlin,et al.  Access Path Selection in a Relational Database Management System , 1989 .

[12]  Won Kim,et al.  A new way to compute the product and join of relations , 1980, SIGMOD '80.

[13]  David J. DeWitt,et al.  Duplicate record elimination in large data files , 1983, TODS.

[14]  J. Eliot B. Moss,et al.  Working with Persistent Objects: To Swizzle or Not to Swizzle , 1992, IEEE Trans. Software Eng..

[15]  Eugene J. Shekita,et al.  Fundamental techniques for order optimization , 1996, SIGMOD '96.

[16]  Goetz Graefe Master-detail clustering using merged indexes , 2007, Informatik - Forschung und Entwicklung.

[17]  Guangqiang Li On the design and evaluation of a new order-based join algorithm , 2010 .

[18]  Rudolf Bayer,et al.  Prefix B-trees , 1977, TODS.

[19]  Robert B. Hagmann,et al.  An Observation on Database Buffering Performance Metrics , 1986, VLDB.

[20]  Quanzhong Li,et al.  Adaptively Reordering Joins during Query Execution , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[21]  Goetz Graefe,et al.  A survey of B-tree locking techniques , 2010, TODS.

[22]  C. Mohan,et al.  Single Table Access Using Multiple Indexes: Optimization, Execution, and Concurrency Control Techniques , 1990, EDBT.

[23]  Per-Åke Larson,et al.  Buffering and Read-Ahead Strategies for External Mergesort , 1998, VLDB.

[24]  David Maier,et al.  Efficient Assembly of Complex Objects ; CU-CS-502-90 , 1990 .

[25]  Leonard D. Shapiro,et al.  Join processing in database systems with large main memories , 1986, TODS.

[26]  Guy M. Lohman,et al.  Grammar-like functional rules for representing query optimization alternatives , 1988, SIGMOD '88.

[27]  David J. DeWitt,et al.  Nested loops revisited , 1993, [1993] Proceedings of the Second International Conference on Parallel and Distributed Information Systems.

[28]  Bernhard Seeger,et al.  Progressive Merge Join: A Generic and Non-blocking Sort-based Join Algorithm , 2002, VLDB.

[29]  Per-Åke Larson,et al.  External Sorting: Run Formation Revisited , 2003, IEEE Trans. Knowl. Data Eng..

[30]  Goetz Graefe B-tree indexes, interpolation search, and skew , 2006, DaMoN '06.

[31]  Michael J. Carey,et al.  A performance evaluation of pointer-based joins , 1990, SIGMOD '90.

[32]  Goetz Graefe,et al.  Sort versus Hash Revisited , 1994, IEEE Trans. Knowl. Data Eng..

[33]  Michael Stonebraker,et al.  Implementation techniques for main memory database systems , 1984, SIGMOD '84.

[34]  David J. DeWitt,et al.  Multiprocessor Hash-Based Join Algorithms , 1985, VLDB.

[35]  Hansjörg Zeller,et al.  An Adaptive Hash Join Algorithm for Multiuser Environments , 1990, VLDB.

[36]  Masaya Nakayama,et al.  The Effect of Bucket Size Tuning in the Dynamic Hybrid GRACE Hash Join Method , 1989, VLDB.

[37]  Miron Livny,et al.  Memory-Adaptive External Sorting , 1993, VLDB.

[38]  Jennifer Widom,et al.  Content-Based Routing: Different Plans for Different Data , 2005, VLDB.

[39]  Masaya Nakayama,et al.  Hash-Partitioned Join Method Using Dynamic Destaging Strategy , 1988, VLDB.

[40]  Harumi A. Kuno,et al.  Self-selecting, self-tuning, incrementally optimized indexes , 2010, EDBT '10.

[41]  C. Mohan,et al.  Algorithms for creating indexes for very large tables without quiescing updates , 1992, SIGMOD '92.

[42]  Alfons Kemper,et al.  Generalised Hash Teams for Join and Group-by , 1999, VLDB.

[43]  Goetz Graefe,et al.  Query evaluation techniques for large databases , 1993, CSUR.

[44]  Laura M. Haas,et al.  Seeking the truth about ad hoc join costs , 1997, The VLDB Journal.

[45]  Yannis E. Ioannidis,et al.  Randomized algorithms for optimizing large join queries , 1990, SIGMOD '90.

[46]  Sven Helmer,et al.  Diag-Join: An Opportunistic Join Algorithm for 1:N Relationships , 1998, VLDB.

[47]  Michael J. Carey,et al.  On saying “Enough already!” in SQL , 1997, SIGMOD '97.

[48]  Volker Markl,et al.  LEO: An autonomic query optimizer for DB2 , 2003, IBM Syst. J..

[49]  David J. DeWitt,et al.  The EXODUS optimizer generator , 1987, SIGMOD '87.

[50]  Betty Salzberg,et al.  Merging sorted runs using large main memory , 1989, Acta Informatica.

[51]  David J. DeWitt,et al.  GAMMA - A High Performance Dataflow Database Machine , 1986, VLDB.

[52]  Goetz Graefe,et al.  Sorting And Indexing With Partitioned B-Trees , 2003, CIDR.

[53]  Goetz Graefe,et al.  Fast algorithms for universal quantification in large databases , 1995, TODS.

[54]  Joseph M. Hellerstein,et al.  Eddies: continuously adaptive query processing , 2000, SIGMOD '00.

[55]  Goetz Graefe,et al.  Hash Joins and Hash Teams in Microsoft SQL Server , 1998, VLDB.

[56]  Goetz Graefe,et al.  Memory management during run generation in external sorting , 1998, SIGMOD '98.

[57]  Goetz Graefe,et al.  Executing Nested Queries , 2003, BTW.

[58]  Guy M. Lohman,et al.  Query Optimization in the IBM DB2 Family. , 1993 .

[59]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[60]  Michael J. Franklin,et al.  Streaming Queries over Streaming Data , 2002, VLDB.

[61]  Eugene Wong,et al.  Query Processing In A Relational Database Management System , 1979, Fifth International Conference on Very Large Data Bases, 1979..

[62]  Nathan Goodman,et al.  On the translation of relational queries into iterative programs , 1989, TODS.

[63]  Goetz Graefe,et al.  Optimization of dynamic query evaluation plans , 1994, SIGMOD '94.