Just can't get enough: Synthesizing Big Data

With the rapidly decreasing prices for storage and storage systems ever larger data sets become economical. While only few years ago only successful transactions would be recorded in sales systems, today every user interaction will be stored for ever deeper analysis and richer user modeling. This has led to the development of big data systems, which offer high scalability and novel forms of analysis. Due to the rapid development and ever increasing variety of the big data landscape, there is a pressing need for tools for testing and benchmarking. Vendors have little options to showcase the performance of their systems but to use trivial data sets like TeraSort or WordCount. Since customers' real data is typically subject to privacy regulations and rarely can be utilized, simplistic proof-of-concepts have to be used, leaving both, customers and vendors, unclear of the target use-case performance. As a solution, we present an automatic approach to data synthetization from existing data sources. Our system enables a fully automatic generation of large amounts of complex, realistic, synthetic data.

[1]  Lyublena Antova,et al.  Reversing statistics for scalable test databases generation , 2013, DBTest '13.

[2]  Rico Wind,et al.  Simple and realistic data generation , 2006, VLDB.

[3]  Tilmann Rabl,et al.  A PDGF Implementation for TPC-H , 2011, TPCTC.

[4]  Chunjie Luo,et al.  BDGS: A Scalable Big Data Generator Suite in Big Data Benchmarking , 2013, WBDB.

[5]  Adam Silberstein,et al.  Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[6]  Rui Xiao,et al.  Development of a Synthetic Data Set Generator for Building and Testing Information Discovery Systems , 2006, Third International Conference on Information Technology: New Generations (ITNG'06).

[7]  Wing-Kai Hon,et al.  Generating databases for query workloads , 2010, Proc. VLDB Endow..

[8]  W ThompsonCraig,et al.  A parallel general-purpose synthetic data generator , 2007 .

[9]  Tilmann Rabl,et al.  Rapid development of data generators using meta generators in PDGF , 2013, DBTest '13.

[10]  Kenneth Baclawski,et al.  Quickly generating billion-record synthetic databases , 1994, SIGMOD '94.

[11]  Carsten Binnig,et al.  QAGen: generating query-aware test databases , 2007, SIGMOD '07.

[12]  Jian Li,et al.  Data generation using declarative constraints , 2011, SIGMOD '11.

[13]  Tilmann Rabl,et al.  Efficient update data generation for DBMS benchmarks , 2012, ICPE '12.

[14]  Tilmann Rabl,et al.  A Data Generator for Cloud-Scale Benchmarking , 2010, TPCTC.

[15]  Tilmann Rabl,et al.  TPC-DI: The First Industry Benchmark for Data Integration , 2014, Proc. VLDB Endow..

[16]  Craig W. Thompson,et al.  A parallel general-purpose synthetic data generator , 2007, SGMD.

[17]  Emina Torlak,et al.  Scalable test data generation from multidimensional models , 2012, SIGSOFT FSE.

[18]  Y. C. Tay,et al.  UpSizeR: Synthetically scaling an empirical relational database , 2013, Inf. Syst..

[19]  Tilmann Rabl,et al.  Variations of the star schema benchmark to test the effects of data skew on query performance , 2013, ICPE '13.

[20]  Meikel Pöss,et al.  MUDD: a multi-dimensional data generator , 2004, WOSP '04.

[21]  Surajit Chaudhuri,et al.  Flexible Database Generators , 2005, VLDB.

[22]  Volker Markl,et al.  Myriad: Scalable and Expressive Data Generation , 2012, Proc. VLDB Endow..

[23]  Meikel Pöss,et al.  New TPC benchmarks for decision support and web commerce , 2000, SGMD.