Identifying Dwarfs Workloads in Big Data Analytics

Big data benchmarking is particularly important and provides applicable yardsticks for evaluating booming big data systems. However, wide coverage and great complexity of big data computing impose big challenges on big data benchmarking. How can we construct a benchmark suite using a minimum set of units of computation to represent diversity of big data analytics workloads? Big data dwarfs are abstractions of extracting frequently appearing operations in big data computing. One dwarf represents one unit of computation, and big data workloads are decomposed into one or more dwarfs. Furthermore, dwarfs workloads rather than vast real workloads are more cost-efficient and representative to evaluate big data systems. In this paper, we extensively investigate six most important or emerging application domains i.e. search engine, social network, e-commerce, multimedia, bioinformatics and astronomy. After analyzing forty representative algorithms, we single out eight dwarfs workloads in big data analytics other than OLAP, which are linear algebra, sampling, logic operations, transform operations, set operations, graph operations, statistic operations and sort.

[1]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[2]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[3]  Edward Y. Chang,et al.  Pfp: parallel fp-growth for query recommendation , 2008, RecSys '08.

[4]  Yuqing Zhu,et al.  BigOP: Generating Comprehensive Big Data Workloads as a Benchmarking Framework , 2014, DASFAA.

[5]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[6]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[7]  Yuqing Zhu,et al.  BigDataBench: A big data benchmark suite from internet services , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[8]  Margo I. Seltzer,et al.  The case for application-specific benchmarking , 1999, Proceedings of the Seventh Workshop on Hot Topics in Operating Systems.

[9]  Olfa Nasraoui,et al.  Mining search engine query logs for query recommendation , 2006, WWW '06.

[10]  Qi He,et al.  Web Query Recommendation via Sequential Query Prediction , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[11]  Joseph M. Hellerstein,et al.  GraphLab: A New Framework For Parallel Machine Learning , 2010, UAI.

[12]  Adam Silberstein,et al.  Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[13]  Timothy G. Armstrong,et al.  LinkBench: a database benchmark based on the Facebook social graph , 2013, SIGMOD '13.

[14]  Daisuke Takahashi,et al.  The HPC Challenge (HPCC) benchmark suite , 2006, SC.

[15]  Jie Huang,et al.  The HiBench benchmark suite: Characterization of the MapReduce-based data analysis , 2010, 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010).

[16]  Yanpei Chen,et al.  From TPC-C to Big Data Benchmarks: A Functional Workload Model , 2012, WBDB.

[17]  Trevor Mudge,et al.  Data dwarfs: Motivating a coverage set for future large data center workloads , 2010 .

[18]  E. F. CODD,et al.  A relational model of data for large shared data banks , 1970, CACM.

[19]  Babak Falsafi,et al.  Clearing the Clouds: A Study of Emerging Workloads on Modern Hardware , 2011 .

[20]  Michael Stonebraker,et al.  A comparison of approaches to large-scale data analysis , 2009, SIGMOD Conference.