A Self-Adjusting Data Distribution Mechanism for Multidimensional Load Balancing in Multiprocessor-Based Database Systems

Abstract With the advent of micro-processor, memory, and communication technology, it is economically feasible to develop a parallel database computer system to improve the performance of database systems. Relations in such an environment are usually partitioned and distributed across computing units. To achieve the optimal performance, it is essential for each unit to have a perfectly balanced load (i.e., identical amount of data). However, fragment sizes may vary due to insertions to and deletions from a relation. To retain good performance, the system needs to periodically rebalance the load of the processors by redistributing data among computing units. Traditionally, the redistribution is performed by reshuffling tuples among processors through a relation repartitioning (e.g., rehashing) process. The computation of this process is at the tuple level. In this paper, we present a self-adjusting data distribution scheme which balances computer workload at a cell (coarser grain than tuple) level during query processing to minimize redistribution cost. The entire scheme is built on top of the popular grid file structure. The adaptivity of the scheme and its relevant features are discussed. The cost of load rebalancing is estimated. The result shows that under our assumptions, it is always beneficial to rebalance computer workload before performing a join on skewed data.

[1]  Arie Segev,et al.  Algorithms for Multidimensional Partitioning of Static Files , 1988, IEEE Trans. Software Eng..

[2]  Patrick Valduriez,et al.  A multikey hashing scheme using predicate trees , 1984, SIGMOD '84.

[3]  Hans-Peter Kriegel,et al.  PLOP-hashing: A grid file without directory , 1988, Proceedings. Fourth International Conference on Data Engineering.

[4]  Michael Stonebraker,et al.  Implementation techniques for main memory database systems , 1984, SIGMOD '84.

[5]  Gary H. Sockut,et al.  Database Reorganization—Principles and Practice , 1979, CSUR.

[6]  Michael Stonebraker,et al.  Concurrency Control and Consistency of Multiple Copies of Data in Distributed Ingres , 1979, IEEE Transactions on Software Engineering.

[7]  Anupam Bhide,et al.  An Analysis of Three Transaction Processing Architectures , 1988, VLDB.

[8]  Kien A. Hua,et al.  An Adaptive Data Placement Scheme for Parallel Database Computer Systems , 1990, VLDB.

[9]  Jürg Nievergelt,et al.  The Grid File: An Adaptable, Symmetric Multikey File Structure , 1984, TODS.

[10]  Masaru Kitsuregawa,et al.  Join strategies on KD-tree indexed relations , 1989, [1989] Proceedings. Fifth International Conference on Data Engineering.

[11]  Kien A. Hua,et al.  Handling Data Skew in Multiprocessor Database Computers Using Partition Tuning , 1991, VLDB.

[12]  Michael Freeston,et al.  The BANG file: A new kind of grid file , 1987, SIGMOD '87.

[13]  Tom W. Keller,et al.  Data placement in Bubba , 1988, SIGMOD '88.

[14]  Philip S. Yu,et al.  Effect of Skew on Join Performance in Parallel Architectures , 1988, Proceedings [1988] International Symposium on Databases in Parallel and Distributed Systems.

[15]  M. Kitsuregawa,et al.  Architecture and performance of relational algebra machine GRACE , 1989 .

[16]  Jon Louis Bentley,et al.  Multidimensional Binary Search Trees in Database Applications , 1979, IEEE Transactions on Software Engineering.

[17]  Michael Stonebraker,et al.  The Design of XPRS , 1988, VLDB.

[18]  Stanley Y. W. Su A microcomputer network system for distributed relational databases: design, implementation, and analysis , 1983 .

[19]  Charles L. Seitz,et al.  Multicomputers: message-passing concurrent computers , 1988, Computer.

[20]  David J. DeWitt,et al.  Implementation of the Database Machine Direct , 1982, IEEE Transactions on Software Engineering.