Big Data Generation

Big data challenges are end-to-end problems. When handling big data it usually has to be preprocessed, moved, loaded, processed, and stored many times. This has led to the creation of big data pipelines. Current benchmarks related to big data only focus on isolated aspects of this pipeline, usually the processing, storage and loading aspects. To this date, there has not been any benchmark presented covering the end-to-end aspect for big data systems. In this paper, we discuss the necessity of ETL like tasks in big data benchmarking and propose the Parallel Data Generation Framework PDGF for its data generation. PDGF is a generic data generator that was implemented at the University of Passau and is currently adopted in TPC benchmarks.

[1]  Raghunath Nambiar,et al.  Performance Evaluation, Measurement and Characterization of Complex Systems , 2010, Lecture Notes in Computer Science.

[2]  Tilmann Rabl,et al.  Efficient update data generation for DBMS benchmarks , 2012, ICPE '12.

[3]  Raghunath Nambiar,et al.  Topics in Performance Evaluation, Measurement and Characterization , 2011, Lecture Notes in Computer Science.

[4]  Surajit Chaudhuri,et al.  Flexible Database Generators , 2005, VLDB.

[5]  Tilmann Rabl,et al.  Generating Shifting Workloads to Benchmark Adaptability in Relational Database Systems , 2009, TPCTC.

[6]  Tilmann Rabl,et al.  Solving Big Data Challenges for Enterprise Application Performance Management , 2012, Proc. VLDB Endow..

[7]  Tilmann Rabl,et al.  A PDGF Implementation for TPC-H , 2011, TPCTC.

[8]  Tilmann Rabl,et al.  Rapid development of data generators using meta generators in PDGF , 2013, DBTest '13.

[9]  Tilmann Rabl,et al.  Parallel data generation for performance analysis of large, complex RDBMS , 2011, DBTest '11.

[10]  Jason Wittenberg,et al.  Clarify: Software for Interpreting and Presenting Statistical Results , 2003 .

[11]  Rui Xiao,et al.  Development of a Synthetic Data Set Generator for Building and Testing Information Discovery Systems , 2006, Third International Conference on Information Technology: New Generations (ITNG'06).

[12]  Meikel Pöss,et al.  MUDD: a multi-dimensional data generator , 2004, WOSP '04.

[13]  Raghunath Nambiar,et al.  Selected Topics in Performance Evaluation and Benchmarking , 2012, Lecture Notes in Computer Science.

[14]  Daniel Pol,et al.  Principles for an ETL Benchmark , 2009, TPCTC.

[15]  Jie Huang,et al.  The HiBench benchmark suite: Characterization of the MapReduce-based data analysis , 2010, 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010).

[16]  Raghunath Othayoth Nambiar,et al.  Why You Should Run TPC-DS: A Workload Analysis , 2007, VLDB.

[17]  Adam Silberstein,et al.  Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[18]  Tilmann Rabl,et al.  A Data Generator for Cloud-Scale Benchmarking , 2010, TPCTC.

[19]  Lin Xiao,et al.  YCSB++: benchmarking and performance debugging advanced features in scalable table stores , 2011, SoCC.

[20]  Chaitanya K. Baru,et al.  Setting the Direction for Big Data Benchmark Standards , 2012, TPCTC.

[21]  Michael J. Carey,et al.  BDMS Performance Evaluation: Practices, Pitfalls, and Possibilities , 2012, TPCTC.

[22]  Tilmann Rabl,et al.  Benchmarking Big Data Systems and the BigData Top100 List , 2013, Big Data.

[23]  Rico Wind,et al.  Simple and realistic data generation , 2006, VLDB.

[24]  Volker Markl,et al.  Myriad: Scalable and Expressive Data Generation , 2012, Proc. VLDB Endow..

[25]  Lieven Eeckhout,et al.  Performance Evaluation and Benchmarking , 2005 .

[26]  Setsuo Ohsuga,et al.  INTERNATIONAL CONFERENCE ON VERY LARGE DATA BASES , 1977 .