Flexible Database Generators

Evaluation and applicability of many database techniques, ranging from access methods, histograms, and optimization strategies to data normalization and mining, crucially depend on their ability to cope with varying data distributions in a robust way. However, comprehensive real data is often hard to come by, and there is no flexible data generation framework capable of modelling varying rich data distributions. This has led individual researchers to develop their own ad-hoc data generators for specific tasks. As a consequence, the resulting data distributions and query workloads are often hard to reproduce, analyze, and modify, thus preventing their wider usage. In this paper we present a flexible, easy to use, and scalable framework for database generation. We then discuss how to map several proposed synthetic distributions to our framework and report preliminary results.

[1]  Jim Gray,et al.  The Benchmark Handbook for Database and Transaction Systems , 1993 .

[2]  David J. DeWitt,et al.  Benchmarking Database Systems A Systematic Approach , 1983, VLDB.

[3]  Surajit Chaudhuri,et al.  Automating Statistics Management for Query Optimizers , 2001, IEEE Trans. Knowl. Data Eng..

[4]  Jim Gray,et al.  Benchmark Handbook: For Database and Transaction Processing Systems , 1992 .

[5]  Carolyn Turbyfill,et al.  AS3AP - A Comparative Relational Database Benchmark , 1989 .

[6]  Sudipto Guha,et al.  Dynamic multidimensional histograms , 2002, SIGMOD '02.

[7]  Kenneth Baclawski,et al.  Quickly generating billion-record synthetic databases , 1994, SIGMOD '94.

[8]  Surajit Chaudhuri,et al.  Self-tuning histograms: building histograms without looking at data , 1999, SIGMOD '99.

[9]  Daniel C. Zilio,et al.  DB2 advisor: an optimizer smart enough to recommend its own indexes , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[10]  Surajit Chaudhuri,et al.  Conditional selectivity for statistics on query expressions , 2004, SIGMOD '04.

[11]  Jeffrey F. Naughton,et al.  Generating Synthetic Complex-Structured XML Data , 2001, WebDB.

[12]  Yannis E. Ioannidis,et al.  Selectivity Estimation Without the Attribute Value Independence Assumption , 1997, VLDB.

[13]  Patrick E. O'Neil A Set Query Benchmark for Large Databases , 1989, Int. CMG Conference.

[14]  Surajit Chaudhuri,et al.  Materialized view and index selection tool for Microsoft SQL server 2000 , 2001, SIGMOD '01.

[15]  Luis Gravano,et al.  STHoles: a multidimensional workload-aware histogram , 2001, SIGMOD '01.

[16]  Denilson Barbosa,et al.  ToXgene: a template-based data generator for XML , 2002, SIGMOD '02.

[17]  Raghu Ramakrishnan,et al.  Dynamic Histograms: Capturing Evolving Data Sets , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[18]  Denilson Barbosa,et al.  ToXgene: An extensible template-based data generator for XML , 2002, WebDB.