Efficient bulk insertion into a distributed ordered table

We study the problem of bulk-inserting records into tables in a system that horizontally range-partitions data over a large cluster of shared-nothing machines. Each table partition contains a contiguous portion of the table's key range, and must accept all records inserted into that range. Examples of such systems include BigTable[8] at Google, and PNUTS [15] at Yahoo! During bulk inserts into an existing table, if most of the inserted records end up going into a small number of data partitions, the obtained throughput may be very poor due to ineffective use of cluster parallelism. We propose a novel approach in which a planning phase is invoked before the actual insertions. By creating new partitions and intelligently distributing partitions across machines, the planning phase ensures that the insertion load will be well-balanced. Since there is a tradeoff between the cost of moving partitions and the resulting throughput gain, the planning phase must minimize the sum of partition movement time and insertion time. We show that this problem is a variation of NP-hard bin-packing, reduce it to a problem of packing vectors, and then give a solution with provable approximation guarantees. We evaluate our approach on a prototype system deployed on a cluster of 50 machines, and show that it yields significant improvements over more naïve techniques.

[1]  Li Fan,et al.  Web caching and Zipf-like distributions: evidence and implications , 1999, IEEE INFOCOM '99. Conference on Computer Communications. Proceedings. Eighteenth Annual Joint Conference of the IEEE Computer and Communications Societies. The Future is Now (Cat. No.99CH36320).

[2]  Paolo Toth,et al.  Knapsack Problems: Algorithms and Computer Implementations , 1990 .

[3]  Richard Loulou,et al.  New Greedy-Like Heuristics for the Multidimensional 0-1 Knapsack Problem , 1979, Oper. Res..

[4]  Bernhard Seeger,et al.  A Generic Approach to Bulk Loading Multidimensional Index Structures , 1997, VLDB.

[5]  Community Systems Group Community systems research at Yahoo! , 2007, SGMD.

[6]  Jin-Kao Hao,et al.  A hybrid approach for the 0-1 multidimensional knapsack problem , 2001, IJCAI 2001.

[7]  John C. Fisk,et al.  An algorithm for 0‐1 multiple‐knapsack problems , 1978 .

[8]  Philip A. Bernstein,et al.  Data Management Issues in Supporting Large-Scale Web Services , 2006, IEEE Data Eng. Bull..

[9]  Jin-Kao Hao,et al.  A Hybrid Approach for the 01 Multidimensional Knapsack problem , 2001, IJCAI.

[10]  Goetz Graefe,et al.  B-tree indexes for high update rates , 2006, SIGMOD Rec..

[11]  Carlos Maltzahn,et al.  Ceph: a scalable, high-performance distributed file system , 2006, OSDI '06.

[12]  Hans-Arno Jacobsen,et al.  PNUTS: Yahoo!'s hosted data serving platform , 2008, Proc. VLDB Endow..

[13]  C. Mohan,et al.  Algorithms for creating indexes for very large tables without quiescing updates , 1992, SIGMOD '92.

[14]  S. Sudarshan,et al.  Incremental Organization for Data Recording and Warehousing , 1997, VLDB.

[15]  Jeffrey F. Naughton,et al.  Sampling Issues in Parallel Database Systems , 1992, EDBT.

[16]  Jeffrey F. Naughton,et al.  OODB Bulk Loading Revisited: The Partitioned-List Approach , 1995, VLDB.

[17]  David J. DeWitt,et al.  Parallel sorting on a shared-nothing architecture using probabilistic splitting , 1991, [1991] Proceedings of the First International Conference on Parallel and Distributed Information Systems.

[18]  S. Senju,et al.  An Approach to Linear Programming with 0--1 Variables , 1968 .

[19]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.

[20]  Sanjeev Khanna,et al.  A Polynomial Time Approximation Scheme for the Multiple Knapsack Problem , 2005, SIAM J. Comput..

[21]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[22]  Patrick E. O'Neil,et al.  The log-structured merge-tree (LSM-tree) , 1996, Acta Informatica.

[23]  Marcos K. Aguilera,et al.  Sinfonia: a new paradigm for building scalable distributed systems , 2007, SOSP.

[24]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.