BigBench: towards an industry standard benchmark for big data analytics

There is a tremendous interest in big data by academia, industry and a large user base. Several commercial and open source providers unleashed a variety of products to support big data storage and processing. As these products mature, there is a need to evaluate and compare the performance of these systems. In this paper, we present BigBench, an end-to-end big data benchmark proposal. The underlying business model of BigBench is a product retailer. The proposal covers a data model and synthetic data generator that addresses the variety, velocity and volume aspects of big data systems containing structured, semi-structured and unstructured data. The structured part of the BigBench data model is adopted from the TPC-DS benchmark, which is enriched with semi-structured and unstructured data components. The semi-structured part captures registered and guest user clicks on the retailer's website. The unstructured data captures product reviews submitted online. The data generator designed for BigBench provides scalable volumes of raw data based on a scale factor. The BigBench workload is designed around a set of queries against the data model. From a business prospective, the queries cover the different categories of big data analytics proposed by McKinsey. From a technical prospective, the queries are designed to span three different dimensions based on data sources, query processing types and analytic techniques. We illustrate the feasibility of BigBench by implementing it on the Teradata Aster Database. The test includes generating and loading a 200 Gigabyte BigBench data set and testing the workload by executing the BigBench queries (written using Teradata Aster SQL-MR) and reporting their response times.

[1]  Tilmann Rabl,et al.  Solving Big Data Challenges for Enterprise Application Performance Management , 2012, Proc. VLDB Endow..

[2]  Tilmann Rabl,et al.  Efficient update data generation for DBMS benchmarks , 2012, ICPE '12.

[3]  Lin Xiao,et al.  YCSB++: benchmarking and performance debugging advanced features in scalable table stores , 2011, SoCC.

[4]  J. Manyika Big data: The next frontier for innovation, competition, and productivity , 2011 .

[5]  Matthias Nicola,et al.  EXRT: Towards a Simple Benchmark for XML Readiness Testing , 2010, TPCTC.

[6]  Tilmann Rabl,et al.  A Data Generator for Cloud-Scale Benchmarking , 2010, TPCTC.

[7]  Adam Silberstein,et al.  Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[8]  Craig Chambers,et al.  FlumeJava: easy, efficient data-parallel pipelines , 2010, PLDI '10.

[9]  Daniel Pol,et al.  Principles for an ETL Benchmark , 2009, TPCTC.

[10]  John Cieslewicz,et al.  SQL/MapReduce: A practical approach to self-describing, polymorphic, and parallelizable user-defined functions , 2009, Proc. VLDB Endow..

[11]  Michael Stonebraker,et al.  A comparison of approaches to large-scale data analysis , 2009, SIGMOD Conference.

[12]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[13]  Sanjay Ghemawat,et al.  MapReduce: simplified data processing on large clusters , 2008, CACM.

[14]  Raghunath Othayoth Nambiar,et al.  Why You Should Run TPC-DS: A Workload Analysis , 2007, VLDB.

[15]  Raghunath Othayoth Nambiar,et al.  The making of TPC-DS , 2006, VLDB.

[16]  Rob Pike,et al.  Interpreting the data: Parallel analysis with Sawzall , 2005, Sci. Program..

[17]  Ioana Manolescu,et al.  XMark: A Benchmark for XML Data Management , 2002, VLDB.

[18]  Ioana Manolescu,et al.  A Benchmark for XML Data Management , 2002 .

[19]  Jon Louis Bentley,et al.  Programming pearls (2nd ed.) , 1999 .

[20]  David J. DeWitt,et al.  The BUCKY object-relational benchmark , 1997, SIGMOD '97.

[21]  D. DeWitt,et al.  The BUCKY Object-Relational Benchmark (Experience Paper) , 1997, SIGMOD Conference.

[22]  David J. DeWitt,et al.  The 007 Benchmark , 1993, SIGMOD '93.

[23]  David J. DeWitt,et al.  The oo7 Benchmark , 1993, SIGMOD Conference.

[24]  Jon Louis Bentley,et al.  Programming pearls , 1987, CACM.