Benchmarking big data systems: A survey

Abstract With the enormous growth on the availability and usage of Big Data storage and processing systems, it has become essential to assess the various performance aspects of these systems so that we can carefully understand their strong and weak aspects. In practice, currently, when an individual/enterprise aims to develop a Big Data storage and processing solution for harnessing the knowledge inside their data, they will get challenged by the availability of several frameworks from which they need to select. This is a challenging task which needs to directed by with good knowledge about various perspectives of such systems. Additionally, the choice normally vary from one scenario to another according to the essential needs of the application. In practice, there is no single benchmark study which can cover the different types of big data processing requirements, systems, application scenarios and metrics. Several benchmarks and benchmarking studies have been developed where each study focuses on some representative type of frameworks and only consider some aspects to cover. In this article, we provide a comprehensive survey and analysis of the state-of-the-art of benchmarking the different types of big data systems (e.g., NoSQL databases, Big SQL engines, Big Streaming engines, Big Graph Processing engines, Big Machine/Deep Learning engines). Additionally, we highlight some of the significant open challenges and missing requirements of current benchmarks of big data systems with suggestions of directions for future extensions and improvements.

[1]  Min Chen,et al.  Wearable Affective Robot , 2018, IEEE Access.

[2]  Ramez Elmasri,et al.  Quantitative Analysis of Scalable NoSQL Databases , 2016, 2016 IEEE International Congress on Big Data (BigData Congress).

[3]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[4]  Adam Silberstein,et al.  Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[5]  Christian Bizer,et al.  The Berlin SPARQL Benchmark , 2009, Int. J. Semantic Web Inf. Syst..

[6]  Daniel Fabbri,et al.  A comparative analysis of state-of-the-art SQL-on-Hadoop systems for interactive analytics , 2017, 2017 IEEE International Conference on Big Data (Big Data).

[7]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.

[8]  Claude Tadonki,et al.  Comparative study between Hadoop and Spark based on Hibench benchmarks , 2016, 2016 2nd International Conference on Cloud Computing Technologies and Applications (CloudTech).

[9]  Seif Haridi,et al.  Apache Flink™: Stream and Batch Processing in a Single Engine , 2015, IEEE Data Eng. Bull..

[10]  Tilmann Rabl,et al.  Benchmarking Data Flow Systems for Scalable Machine Learning , 2017, BeyondMR@SIGMOD.

[11]  Michael D. Ernst,et al.  The HaLoop approach to large-scale iterative data analysis , 2012, The VLDB Journal.

[12]  Jie Huang,et al.  Benchmarking modern distributed streaming platforms , 2016, 2016 IEEE International Conference on Industrial Technology (ICIT).

[13]  Samuel Madden,et al.  From Databases to Big Data , 2012, IEEE Internet Comput..

[14]  Ameet Talwalkar,et al.  MLlib: Machine Learning in Apache Spark , 2015, J. Mach. Learn. Res..

[15]  Min Chen,et al.  Big-Data Analytics for Cloud, IoT and Cognitive Computing , 2017 .

[16]  Jeyhun Karimov,et al.  Benchmarking Distributed Stream Processing Engines , 2018, ICDE.

[17]  Alan Fekete,et al.  YCSB+T: Benchmarking web-scale transactional databases , 2014, 2014 IEEE 30th International Conference on Data Engineering Workshops.

[18]  Alexandru Iosup,et al.  Graphalytics: A Big Data Benchmark for Graph-Processing Platforms , 2015, GRADES@SIGMOD/PODS.

[19]  Otto Carlos Muniz Bandeira Duarte,et al.  A Performance Comparison of Open-Source Stream Processing Platforms , 2016, 2016 IEEE Global Communications Conference (GLOBECOM).

[20]  Hans-Arno Jacobsen,et al.  PNUTS: Yahoo!'s hosted data serving platform , 2008, Proc. VLDB Endow..

[21]  Daniel M. Batista,et al.  A Survey of Large Scale Data Management Approaches in Cloud Environments , 2011, IEEE Communications Surveys & Tutorials.

[22]  Jens Lehmann,et al.  DBpedia SPARQL Benchmark - Performance Assessment with Real Queries on Real Data , 2011, SEMWEB.

[23]  Sherif Sakr,et al.  Large scale graph processing systems: survey and an experimental evaluation , 2015, Cluster Computing.

[24]  Shirish Tatikonda,et al.  SystemML: Declarative Machine Learning on Spark , 2016, Proc. VLDB Endow..

[25]  María S. Pérez-Hernández,et al.  Spark Versus Flink: Understanding Performance in Big Data Analytics Frameworks , 2016, 2016 IEEE International Conference on Cluster Computing (CLUSTER).

[26]  M. Tamer Özsu,et al.  An Experimental Comparison of Pregel-like Graph Processing Systems , 2014, Proc. VLDB Endow..

[27]  Jorge Bernardino,et al.  NoSQL databases: MongoDB vs cassandra , 2013, C3S2E '13.

[28]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[29]  Alexandru Iosup,et al.  How Well Do Graph-Processing Platforms Perform? An Empirical Performance Evaluation and Analysis , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[30]  Jie Huang,et al.  The HiBench benchmark suite: Characterization of the MapReduce-based data analysis , 2010, 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010).

[31]  Mohak Shah,et al.  Comparative Study of Caffe, Neon, Theano, and Torch for Deep Learning , 2015, ArXiv.

[32]  Jing Zhang,et al.  5G-Smart Diabetes: Toward Personalized Diabetes Diagnosis with Healthcare Big Data Clouds , 2018, IEEE Communications Magazine.

[33]  Li Zhang,et al.  SparkBench: a comprehensive benchmarking suite for in memory data analytic platform Spark , 2015, Conf. Computing Frontiers.

[34]  Ion Stoica,et al.  Probabilistically Bounded Staleness for Practical Partial Quorums , 2012, Proc. VLDB Endow..

[35]  Xiaozhou Li,et al.  Analyzing consistency properties for fun and profit , 2011, PODC '11.

[36]  Neil A. Ernst,et al.  Performance Evaluation of NoSQL Databases: A Case Study , 2015, PABS@ICPE.

[37]  Lin Xiao,et al.  YCSB++: benchmarking and performance debugging advanced features in scalable table stores , 2011, SoCC.

[38]  Kiyoung Kim,et al.  MRBench: A Benchmark for MapReduce Framework , 2008, 2008 14th IEEE International Conference on Parallel and Distributed Systems.

[39]  William D. Clinger,et al.  Foundations of Actor Semantics , 1981 .

[40]  Nathan Marz,et al.  Big Data: Principles and best practices of scalable realtime data systems , 2015 .

[41]  Pietro Piazzolla,et al.  Performance Evaluation of NoSQL Databases , 2014, EPEW.

[42]  Sherif Sakr,et al.  Big Data 2.0 Processing Systems: Taxonomy and Open Challenges , 2016, Journal of Grid Computing.

[43]  Wilfred Ng,et al.  Blogel: A Block-Centric Framework for Distributed Computation on Real-World Graphs , 2014, Proc. VLDB Endow..

[44]  Paul T. Groth,et al.  NoSQL Databases for RDF: An Empirical Evaluation , 2013, International Semantic Web Conference.

[45]  Victor C. M. Leung,et al.  Cognitive Information Measurements: A New Perspective , 2019, Inf. Sci..

[46]  Edward Y. Chang,et al.  Distributed Training Large-Scale Deep Architectures , 2017, ADMA.

[47]  Felix Naumann,et al.  The Stratosphere platform for big data analytics , 2014, The VLDB Journal.

[48]  Mohand-Said Hacid,et al.  Benchmarking SQL on MapReduce systems using large astronomy databases , 2014, Distributed and Parallel Databases.

[49]  Sean Owen,et al.  Mahout in Action , 2011 .

[50]  Maribel Yasmina Santos,et al.  Evaluating SQL-on-Hadoop for Big Data Warehousing on Not-So-Good Hardware , 2017, IDEAS.

[51]  Shuai Li,et al.  The Performance of SQL-on-Hadoop Systems - An Experimental Study , 2017, 2017 IEEE International Congress on Big Data (BigData Congress).

[52]  Sabeur Aridhi,et al.  An experimental survey on big data frameworks , 2016, Future Gener. Comput. Syst..

[53]  Xiaoyong Du,et al.  A Study of SQL-on-Hadoop Systems , 2014, BPOE@ASPLOS/VLDB.

[54]  Jinquan Dai,et al.  Experience from Hadoop Benchmarking with HiBench: From Micro-Benchmarks Toward End-to-End Pipelines , 2013, WBDB.

[55]  Abraham Silberschatz,et al.  HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads , 2009, Proc. VLDB Endow..

[56]  M. Tamer Özsu,et al.  Experimental Analysis of Distributed Graph Systems , 2018, Proc. VLDB Endow..

[57]  Jorge Bernardino,et al.  Choosing the right NoSQL database for the job: a quality attribute evaluation , 2015, Journal of Big Data.

[58]  Kai Chen,et al.  Benchmarking of Distributed Computing Engines Spark and GraphLab for Big Data Analytics , 2016, 2016 IEEE Second International Conference on Big Data Computing Service and Applications (BigDataService).

[59]  Tilmann Rabl,et al.  From BigBench to TPCx-BB: Standardization of a Big Data Benchmark , 2016, TPCTC.

[60]  Joseph M. Hellerstein,et al.  Distributed GraphLab: A Framework for Machine Learning in the Cloud , 2012, Proc. VLDB Endow..

[61]  Todor Ivanov,et al.  Performance Evaluation of Spark SQL Using BigBench , 2015, WBDB.

[62]  Patrizio Dazzi,et al.  Opportunistic Task Scheduling Over Co-located Clouds , 2017 .

[63]  Andrey Gubarev,et al.  Dremel : Interactive Analysis of Web-Scale Datasets , 2011 .

[64]  Yelena Yesha,et al.  SQL-like big data environments: Case study in clinical trial analytics , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[65]  Sherif Sakr,et al.  Graph indexing and querying: a review , 2010, Int. J. Web Inf. Syst..