Big Data Benchmark Compendium

The field of Big Data and related technologies is rapidly evolving. Consequently, many benchmarks are emerging, driven by academia and industry alike. As these benchmarks are emphasizing different aspects of Big Data and, in many cases, covering different technical platforms and uses cases, it is extremely difficult to keep up with the pace of benchmark creation. Also with the combinations of large volumes of data, heterogeneous data formats and the changing processing velocity, it becomes complex to specify an architecture which best suits all application requirements. This makes the investigation and standardization of such systems very difficult. Therefore, the traditional way of specifying a standardized benchmark with pre-defined workloads, which have been in use for years in the transaction and analytical processing systems, is not trivial to employ for Big Data systems. This document provides a summary of existing benchmarks and those that are in development, gives a side-by-side comparison of their characteristics and discusses their pros and cons. The goal is to understand the current state in Big Data benchmarking and guide practitioners in their approaches and use cases.

[1]  Jie Huang,et al.  The HiBench benchmark suite: Characterization of the MapReduce-based data analysis , 2010, 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010).

[2]  Jordi Torres,et al.  ALOJA: A systematic study of Hadoop deployment variables to enable automated characterization of cost-effectiveness , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[3]  Fan Zhang,et al.  A characterization of big data benchmarks , 2013, 2013 IEEE International Conference on Big Data.

[4]  Michael Stonebraker,et al.  MapReduce and parallel DBMSs: friends or foes? , 2010, CACM.

[5]  Tilmann Rabl,et al.  A Data Generator for Cloud-Scale Benchmarking , 2010, TPCTC.

[6]  J. Manyika Big data: The next frontier for innovation, competition, and productivity , 2011 .

[7]  Yanpei Chen,et al.  Interactive Analytical Processing in Big Data Systems: A Cross-Industry Study of MapReduce Workloads , 2012, Proc. VLDB Endow..

[8]  Yuqing Zhu,et al.  BigDataBench: A big data benchmark suite from internet services , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[9]  Chunjie Luo,et al.  BDGS: A Scalable Big Data Generator Suite in Big Data Benchmarking , 2013, WBDB.

[10]  Adam Silberstein,et al.  Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[11]  Amit. Sangroya,et al.  MRBS : A Comprehensive MapReduce Benchmark Suite , 2012 .

[12]  Volker Markl,et al.  Issues in big data testing and benchmarking , 2013, DBTest '13.

[13]  Tilmann Rabl,et al.  Parallel data generation for performance analysis of large, complex RDBMS , 2011, DBTest '11.

[14]  Carlo Curino,et al.  Discussion of BigBench: A Proposed Industry Standard Performance Benchmark for Big Data , 2014, TPCTC.

[15]  Sherif Sakr,et al.  Liquid Benchmarks: Towards an Online Platform for Collaborative Assessment of Computer Science Research Results , 2010, TPCTC.

[16]  Michael Stonebraker,et al.  A comparison of approaches to large-scale data analysis , 2009, SIGMOD Conference.

[17]  Sara Bouchenak,et al.  MRBS: Towards Dependability Benchmarking for Hadoop MapReduce , 2012, Euro-Par Workshops.

[18]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[19]  Thomas Willhalm,et al.  Memory system characterization of big data workloads , 2013, 2013 IEEE International Conference on Big Data.

[20]  Shivnath Babu,et al.  Thoth: Towards Managing a Multi-System Cluster , 2014, Proc. VLDB Endow..

[21]  Lin Xiao,et al.  YCSB++: benchmarking and performance debugging advanced features in scalable table stores , 2011, SoCC.

[22]  Gang Lu,et al.  CloudRank-D: benchmarking and ranking cloud computing systems for data processing applications , 2012, Frontiers of Computer Science.

[23]  Kiyoung Kim,et al.  MRBench: A Benchmark for MapReduce Framework , 2008, 2008 14th IEEE International Conference on Parallel and Distributed Systems.

[24]  Raghunath Othayoth Nambiar,et al.  Introducing TPCx-HS: The First Industry Standard for Benchmarking Big Data Systems , 2014, TPCTC.

[25]  John Byrne,et al.  Workload diversity and dynamics in big data analytics: implications to system designers , 2012, ASBD '12.

[26]  Li Zhang,et al.  SparkBench: a comprehensive benchmarking suite for in memory data analytic platform Spark , 2015, Conf. Computing Frontiers.

[27]  Sherif Sakr,et al.  Liquid Benchmarking: A Platform for Democratizing the Performance Evaluation Process , 2015, EDBT.

[28]  Babak Falsafi,et al.  Clearing the clouds: a study of emerging scale-out workloads on modern hardware , 2012, ASPLOS XVII.

[29]  Jaume Ferrarons,et al.  PRIMEBALL: A Parallel Processing Framework Benchmark for Big Data Applications in the Cloud , 2013, TPCTC.

[30]  Yanpei Chen,et al.  From TPC-C to Big Data Benchmarks: A Functional Workload Model , 2012, WBDB.

[31]  Chaitanya K. Baru,et al.  Setting the Direction for Big Data Benchmark Standards , 2012, TPCTC.

[32]  Archana Ganapathi,et al.  The Case for Evaluating MapReduce Performance Using Workload Suites , 2011, 2011 IEEE 19th Annual International Symposium on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems.