Efficient update data generation for DBMS benchmarks

It is without doubt that industry standard benchmarks have been proven to be crucial to the innovation and productivity of the computing industry. They are important to the fair and standardized assessment of performance across different vendors, different system versions from the same vendor and across different architectures. Good benchmarks are even meant to drive industry and technology forward. Since at some point, after all reasonable advances have been made using a particular benchmark even good benchmarks become obsolete over time. This is why standard consortia periodically overhaul their existing benchmarks or develop new benchmarks. An extremely time and resource consuming task in the creation of new benchmarks is the development of benchmark generators, especially because benchmarks tend to become more and more complex. The first version of the Parallel Data Generation Framework (PDGF), a generic data generator, was capable of generating data for the initial load of arbitrary relational schemas. It was, however, not able to generate data for the actual workload, i.e. input data for transactions (insert, delete and update), incremental load etc., mainly because it did not understand the notion of updates. Updates are data changes that occur over time, e.g. a customer changes address, switches job, gets married or has children. Many benchmarks, need to reflect these changes during their workloads. In this paper we present PDGF Version 2, which contains extensions enabling the generation of update data.

[1]  Rui Xiao,et al.  Development of a Synthetic Data Set Generator for Building and Testing Information Discovery Systems , 2006, Third International Conference on Information Technology: New Generations (ITNG'06).

[2]  Tilmann Rabl,et al.  A PDGF Implementation for TPC-H , 2011, TPCTC.

[3]  Surajit Chaudhuri,et al.  Flexible Database Generators , 2005, VLDB.

[4]  Craig W. Thompson,et al.  A parallel general-purpose synthetic data generator , 2007, SGMD.

[5]  Adam Silberstein,et al.  Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[6]  Jim Gray,et al.  The Benchmark Handbook for Database and Transaction Systems , 1993 .

[7]  David J. DeWitt,et al.  Benchmarking Database Systems A Systematic Approach , 1983, VLDB.

[8]  Patrick E. O'Neil The Set Query Benchmark , 1991, The Benchmark Handbook.

[9]  Meikel Pöss,et al.  MUDD: a multi-dimensional data generator , 2004, WOSP '04.

[10]  Daniel Pol,et al.  Principles for an ETL Benchmark , 2009, TPCTC.

[11]  Meikel Pöss,et al.  New TPC benchmarks for decision support and web commerce , 2000, SGMD.

[12]  Raghunath Othayoth Nambiar,et al.  The making of TPC-DS , 2006, VLDB.

[13]  Pierre L'Ecuyer,et al.  On the xorshift random number generators , 2005, TOMC.

[14]  Tilmann Rabl,et al.  A Data Generator for Cloud-Scale Benchmarking , 2010, TPCTC.

[15]  Tilmann Rabl,et al.  Generating Shifting Workloads to Benchmark Adaptability in Relational Database Systems , 2009, TPCTC.

[16]  Karl Huppler,et al.  The Art of Building a Good Benchmark , 2009, TPCTC.

[17]  Tilmann Rabl,et al.  Parallel data generation for performance analysis of large, complex RDBMS , 2011, DBTest '11.

[18]  Kenneth Baclawski,et al.  Quickly generating billion-record synthetic databases , 1994, SIGMOD '94.

[19]  Rico Wind,et al.  Simple and realistic data generation , 2006, VLDB.