Data placement in Bubba

This paper examines the problem of data placement in Bubba, a highly-parallel system for data-intensive applications being developed at MCC. “Highly-parallel” implies that load balancing is a critical performance issue. “Data-intensive” means data is so large that operations should be executed where the data resides. As a result, data placement becomes a critical performance issue. In general, determining the optimal placement of data across processing nodes for performance is a difficult problem. We describe our heuristic approach to solving the data placement problem in Bubba. We then present experimental results using a specific workload to provide insight into the problem. Several researchers have argued the benefits of declustering (i e, spreading each base relation over many nodes). We show that as declustering is increased, load balancing continues to improve. However, for transactions involving complex joins, further declustering reduces throughput because of communications, startup and termination overhead. We argue that data placement, especially declustering, in a highly-parallel system must be considered early in the design, so that mechanisms can be included for supporting variable declustering, for minimizing the most significant overheads associated with large-scale declustering, and for gathering the required statistics.

[1]  Ben Shneiderman Optimum data base reorganization points , 1973, CACM.

[2]  Kapali P. Eswaran Placement of Records in a File and File Allocation in a Computer , 1974, IFIP Congress.

[3]  Samy Mahmoud,et al.  Optimal allocation of resources in distributed information networks , 1975, SIGF.

[4]  J. Spruce Riordon,et al.  Optimal allocation of resources in distributed information networks , 1976, TODS.

[5]  K. Maruyama,et al.  Optimal reorganization of distributed space disk files , 1976, CACM.

[6]  Toby J. Teorey,et al.  A dynamic database reorganization algorithm , 1976, TODS.

[7]  Jim Gray,et al.  Notes on Data Base Operating Systems , 1978, Advanced Course: Operating Systems.

[8]  William G. Tuel Optimum reorganization points for linearly growing files , 1978, TODS.

[9]  Jacques Kouloumdjian,et al.  Data Base Reorganization by Clustering Methods , 1978, Inf. Syst..

[10]  Peter J. Denning,et al.  The Operational Analysis of Queueing Network Models , 1978, CSUR.

[11]  Gary H. Sockut,et al.  Database Reorganization—Principles and Practice , 1979, CSUR.

[12]  Matti Jakobsson,et al.  Reducing block accesses in inverted files by partial clustering , 1980, Inf. Syst..

[13]  Don S. Batory Optimal file designs and reorganization points , 1982, TODS.

[14]  Edward D. Lazowska,et al.  Quantitative system performance - computer system analysis using queueing network models , 1983, Int. CMG Conference.

[15]  Shikharesh Majumdar,et al.  A measure of program locality and its application , 1984, SIGMETRICS '84.

[16]  Peter Scheuermann,et al.  A Global Approach to Record Clustering and File Reorganization , 1984, SIGIR.

[17]  Philip A. Bernstein,et al.  Site Initialization, Recovery, and Backup in a Distributed Database System , 1984, IEEE Transactions on Software Engineering.

[18]  Michael Stonebraker,et al.  A measure of transaction processing power , 1985 .

[19]  Daniel P. Siewiorek,et al.  The Influence of Parallel Decomposition Strategies on the Performance of Multiprocessor Systems , 1985, ISCA.

[20]  Clement T. Yu,et al.  Adaptive record clustering , 1985, TODS.

[21]  이헌,et al.  [기술동향 소개]Fault Tolerant Computing System , 1985 .

[22]  Michael Stonebraker,et al.  The Case for Shared Nothing , 1985, HPTS.

[23]  Edward D. Lazowska,et al.  Quantitative System Performance , 1985, Int. CMG Conference.

[24]  Kai Hwang,et al.  Computer architecture and parallel processing , 1984, McGraw-Hill Series in computer organization and architecture.

[25]  David J. DeWitt,et al.  GAMMA - A High Performance Dataflow Database Machine , 1986, VLDB.

[26]  Tom W. Keller,et al.  A workload characterization pipeline for models of parallel systems , 1987, SIGMETRICS '87.

[27]  Zarka Cvetanovic,et al.  The Effects of Problem Partitioning, Allocation, and Granularity on the Performance of Multiple-Processor Systems , 1987, IEEE Transactions on Computers.

[28]  Miron Livny,et al.  Multi-disk management algorithms , 1987, SIGMETRICS '87.

[29]  Jim Gray,et al.  The 5 minute rule for trading memory for disc accesses and the 10 byte rule for trading memory for CPU time , 1987, SIGMOD '87.

[30]  Harald Sammer Online Stock Trading Systems: Study of an Application , 1987, COMPCON.

[31]  David J. DeWitt,et al.  A Single User Evaluation of the Gamma Database Machine , 1987, IWDM.

[32]  Ravi Mukkamala,et al.  Design of partially replicated distributed database systems: an integrated methodology , 1988, SIGMETRICS 1988.

[33]  William Alexander,et al.  Process and dataflow control in distributed data-intensive systems , 1988, SIGMOD '88.

[34]  W. Alexander,et al.  Comparison of dataflow control techniques in distributed data-intensive systems , 1988, SIGMETRICS 1988.

[35]  Tom Keller,et al.  A Tool for Performance-Driven Design of Parallel Systems , 1989 .