The Implications of Diverse Applications and Scalable Data Sets in Benchmarking Big Data Systems

Now we live in an era of big data, and big data applications are becoming more and more pervasive. How to benchmark data center computer systems running big data applications in short big data systems is a hot topic. In this paper, we focus on measuring the performance impacts of diverse applications and scalable volumes of data sets on big data systems. For four typical data analysis applications--an important class of big data applications, we find two major results through experiments: first, the data scale has a significant impact on the performance of big data systems, so we must provide scalable volumes of data sets in big data benchmarks. Second, for the four applications, even all of them use the simple algorithms, the performance trends are different with increasing data scales, and hence we must consider not only variety of data sets but also variety of applications in benchmarking big data systems.

[1]  Gang Lu,et al.  CloudRank-D: benchmarking and ranking cloud computing systems for data processing applications , 2012, Frontiers of Computer Science.

[2]  Tilmann Rabl,et al.  Benchmarking Big Data Systems and the BigData Top100 List , 2013, Big Data.

[3]  Stephen A. Cook,et al.  Time-bounded random access machines , 1972, J. Comput. Syst. Sci..

[4]  Xiaona Li,et al.  Cost-Aware Cooperative Resource Provisioning for Heterogeneous Workloads in Data Centers , 2013, IEEE Transactions on Computers.

[5]  栄藤 稔 ビッグデータとパターン認識 ~ More data usually beats better algorithms? ~ , 2011 .

[6]  Luiz André Barroso,et al.  The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines , 2009, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines.

[7]  Babak Falsafi,et al.  Clearing the clouds: a study of emerging scale-out workloads on modern hardware , 2012, ASPLOS XVII.

[8]  Jianfeng Zhan,et al.  Characterizing OS behavior of Scale-out Data Center Workloads , 2013 .

[9]  Babak Falsafi,et al.  Clearing the Clouds: A Study of Emerging Workloads on Modern Hardware , 2011 .

[10]  Jianfeng Zhan,et al.  Precise, Scalable, and Online Request Tracing for Multitier Services of Black Boxes , 2012, IEEE Transactions on Parallel and Distributed Systems.

[11]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[12]  Yan Li,et al.  Understanding Systems and Architectures for Big Data , 2012 .

[13]  Xiaona Li,et al.  BigDataBench: a Big Data Benchmark Suite from Web Search Engines , 2013, ArXiv.

[14]  Chunjie Luo,et al.  High Volume Throughput Computing: Identifying and Characterizing Throughput Oriented Workloads in Data Centers , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[15]  Yi Liang,et al.  In Cloud, Can Scientific Communities Benefit from the Economies of Scale? , 2010, IEEE Transactions on Parallel and Distributed Systems.

[16]  Babak Falsafi,et al.  Scale-out processors , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[17]  Scott Shenker,et al.  Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling , 2010, EuroSys '10.

[18]  Jon Howell,et al.  MinuteSort with Flat Datacenter Storage , 2012 .

[19]  Yanpei Chen,et al.  From TPC-C to Big Data Benchmarks: A Functional Workload Model , 2012, WBDB.

[20]  Chunjie Luo,et al.  Characterizing data analysis workloads in data centers , 2013, 2013 IEEE International Symposium on Workload Characterization (IISWC).

[21]  Chaitanya K. Baru,et al.  Setting the Direction for Big Data Benchmark Standards , 2012, TPCTC.

[22]  Steven Skiena,et al.  The Algorithm Design Manual , 2020, Texts in Computer Science.

[23]  Steven S Skiena The algorithm design manual with 72 figures , 2014 .