论文信息 - MUDD: a multi-dimensional data generator

MUDD: a multi-dimensional data generator

Today's business intelligence systems consist of hundreds of processors with disk subsystems able to handle multiple Giga-bytes of IO-bandwidth. These systems usually contain terabytes of data. Evaluating database system performance of such systems often requires generating synthetic data with well defined statistical properties. To simulate different scenarios, it is important to vary statistical properties including row counts of tables. Foremost, in order to analyze large scale systems, data generators need to be able to produce hundreds of terabytes of data in a timely fashion. In this paper we present MUDD, a multi-dimensional data generator. Originally designed for TPC-DS, a decision support benchmark being developed by the TPC, MUDD is able to generate up to 100 Terabyte of flat file data in hours, utilizing modern multi processor architectures, including clusters. Its novel design separates data generation algorithms from data distribution definitions, enabling users to adjust their workload to individual needs and different scenarios.

Meikel Pöss | John M. Stephens | Meikel Pöss

[1] Meikel Pöss,et al. TPC-DS, taking decision support benchmarking to the next level , 2002, SIGMOD '02.

[2] Ralph Kimball,et al. The Data Warehouse Toolkit: Practical Techniques for Building Dimensional Data Warehouses , 1996 .

[3] Kenneth Baclawski,et al. Quickly generating billion-record synthetic databases , 1994, SIGMOD '94.