Algorithms for the Database Layout Problem

We present a formal analysis of the database layout problem, i.e., the problem of determining how database objects such as tables and indexes are assigned to disk drives. Optimizing this layout has a direct impact on the I/O performance of the entire system. The traditional approach of striping each object across all available disk drives is aimed at optimizing I/O parallelism; however, it is suboptimal when queries co-access two or more database objects, e.g., during a merge join of two tables, due to the increase in random disk seeks. We adopt an existing model, which takes into account both the benefit of I/O parallelism and the overhead due to random disk accesses, in the context of a query workload which includes co-access of database objects. The resulting optimization problem is intractable in general and we employ techniques from approximation algorithms to present provable performance guarantees. We show that while optimally exploiting I/O parallelism alone suggests uniformly striping data objects (even for heterogeneous files and disks), optimizing random disk access alone would assign each data object to a single disk drive. This confirms the intuition that the two effects are in tension with each other. We provide approximation algorithms in an attempt to optimize the trade-off between the two effects. We show that our algorithm achieves the best possible approximation ratio.

[1]  Erez Petrank The hardness of approximation: Gap location , 2005, computational complexity.

[2]  Beng Chin Ooi,et al.  Towards self-tuning data placement in parallel database systems , 2000, SIGMOD '00.

[3]  Tom W. Keller,et al.  Data placement in Bubba , 1988, SIGMOD '88.

[4]  Gerhard Weikum,et al.  Data partitioning and load balancing in parallel disk systems , 1998, The VLDB Journal.

[5]  Mihalis Yannakakis,et al.  Multiway Cuts in Directed and Node Weighted Graphs , 1994, ICALP.

[6]  Abhinandan Das,et al.  Automating layout of relational databases , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[7]  Peter Scheuermann,et al.  File Assignment in Parallel I/O Systems with Minimal Variance of Service Time , 2000, IEEE Trans. Computers.

[8]  Gerhard Weikum,et al.  Snowball: Scalable Storage on Networks of Workstations with Balanced Load , 1998, Distributed and Parallel Databases.

[9]  Sanjeev Khanna,et al.  On the Hardness of Approximating Max k-Cut and its Dual , 1997, Chic. J. Theor. Comput. Sci..

[10]  Salvatore J. Stolfo,et al.  Predictive dynamic load balancing of parallel hash-joins over heterogeneous processors in the presence of data skew , 1994, Proceedings of 3rd International Conference on Parallel and Distributed Information Systems.