ShenZhen transportation system (SZTS): a novel big data benchmark suite

Data analytics is at the core of the supply chain for both products and services in modern economies and societies. Big data workloads, however, are placing unprecedented demands on computing technologies, calling for a deep understanding and characterization of these emerging workloads. In this paper, we propose ShenZhen Transportation System (SZTS), a novel big data Hadoop benchmark suite comprised of real-life transportation analysis applications with real-life input data sets from Shenzhen in China. SZTS uniquely focuses on a specific and real-life application domain whereas other existing Hadoop benchmark suites, such as HiBench and CloudRank-D, consist of generic algorithms with synthetic inputs. We perform a cross-layer workload characterization at the microarchitecture level, the operating system (OS) level, and the job level, revealing unique characteristics of SZTS compared to existing Hadoop benchmarks as well as general-purpose multi-core PARSEC benchmarks. We also study the sensitivity of workload behavior with respect to input data size, and we propose a methodology for identifying representative input data sets.

[1]  Eyal Oren,et al.  Sindice.com: a document-oriented lookup index for open linked data , 2008, Int. J. Metadata Semant. Ontologies.

[2]  Archana Ganapathi,et al.  The Case for Evaluating MapReduce Performance Using Workload Suites , 2011, 2011 IEEE 19th Annual International Symposium on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems.

[3]  An-Pin Chen,et al.  An Integrated Home Financial Investment Learning Environment Applying Cloud Computing in Social Network Analysis , 2011, 2011 International Conference on Advances in Social Networks Analysis and Mining.

[4]  Tilmann Rabl,et al.  Benchmarking Big Data Systems and the BigData Top100 List , 2013, Big Data.

[5]  Lieven Eeckhout,et al.  Quantifying the Impact of Input Data Sets on Program Behavior and its Applications , 2003, J. Instr. Level Parallelism.

[6]  Luo Lie Heuristic Artificial Intelligent Algorithm for Genetic Algorithm , 2010 .

[7]  Hamid R. Arabnia,et al.  A Parallel Algorithm for the Arbitrary Rotation of Digitized Images Using Process-and-Data-Decomposition Approach , 1990, J. Parallel Distributed Comput..

[8]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[9]  Dhabaleswar K. Panda,et al.  High performance RDMA-based design of HDFS over InfiniBand , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[10]  M. Schatz,et al.  Searching for SNPs with cloud computing , 2009, Genome Biology.

[11]  Babak Falsafi,et al.  Clearing the clouds: a study of emerging scale-out workloads on modern hardware , 2012, ASPLOS XVII.

[12]  Lieven Eeckhout,et al.  Microarchitecture-Independent Workload Characterization , 2007, IEEE Micro.

[13]  Avi Mendelson,et al.  Deep-dive analysis of the data analytics workload in CloudSuite , 2014, 2014 IEEE International Symposium on Workload Characterization (IISWC).

[14]  Timothy G. Armstrong,et al.  LinkBench: a database benchmark based on the Facebook social graph , 2013, SIGMOD '13.

[15]  Yuqing Zhu,et al.  BigDataBench: A big data benchmark suite from internet services , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[16]  Jie Huang,et al.  The HiBench benchmark suite: Characterization of the MapReduce-based data analysis , 2010, 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010).

[17]  Andrey Balmin,et al.  Jaql , 2011, Proc. VLDB Endow..

[18]  Chengyang Zhang,et al.  Map-matching for low-sampling-rate GPS trajectories , 2009, GIS.

[19]  Hamid R. Arabnia,et al.  Parallel Edge-Region-Based Segmentation Algorithm Targeted at Reconfigurable MultiRing Network , 2003, The Journal of Supercomputing.

[20]  Hamid R. Arabnia,et al.  The REFINE Multiprocessor - Theoretical Properties and Algorithms , 1995, Parallel Comput..

[21]  Hamid R. Arabnia,et al.  A distributed stereocorrelation algorithm , 1995, Proceedings of Fourth International Conference on Computer Communications and Networks - IC3N'95.

[22]  Chaitanya K. Baru,et al.  Setting the Direction for Big Data Benchmark Standards , 2012, TPCTC.

[23]  Chunjie Luo,et al.  Characterizing data analysis workloads in data centers , 2013, 2013 IEEE International Symposium on Workload Characterization (IISWC).

[24]  Hamid R. Arabnia,et al.  Parallel stereocorrelation on a reconfigurable multi-ring network , 1996, The Journal of Supercomputing.

[25]  Hamid R. Arabnia,et al.  Arbitrary Rotation of Raster Images with SIMD Machine Architectures , 1987, Comput. Graph. Forum.

[26]  Babak Falsafi,et al.  Clearing the Clouds: A Study of Emerging Workloads on Modern Hardware , 2011 .

[27]  Gang Lu,et al.  CloudRank-D: benchmarking and ranking cloud computing systems for data processing applications , 2012, Frontiers of Computer Science.

[28]  Adam Silberstein,et al.  Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[29]  S.M. Bhandarkar,et al.  The Hough Transform on a Reconfigurable Multi-Ring Network , 1995, J. Parallel Distributed Comput..

[30]  Hamid R. Arabnia,et al.  A Reconfigurable Architecture for Image Processing and Computer Vision , 1995, Int. J. Pattern Recognit. Artif. Intell..