Deep-dive analysis of the data analytics workload in CloudSuite

Exponential growth of digital data has introduced massively-parallel systems, special orchestration layers, and new scale-out applications. While recent works suggest characteristics of scale-out workloads are different from those of traditional ones, their root causes are not understood. Such understanding is extremely important to improve efficiency; even a 1% performance gain for a core can have a large impact on the datacenter as a whole. This paper studies the characteristics of a Big Data Analytics (BDA) workload on a modern cloud server. It is intentionally focused on a single workload-platform in order to enable deep-dive analysis that aims to understand the root causes of the CPU bottlenecks which this paper identify. We choose the Data Analytics benchmark from CloudSuite [1] as a representative of a growing family of important applications. This paper describes a customization of a comprehensive threefold analysis method. The method consists of a System level, where sensitivity to system parameters is examined, as well as Application and Architectural levels; where bottlenecks are attributed back to the application and runtime codes, respectively. The paper also adopts a proof-by-optimization approach to prove bottlenecks' validity. Overall, 65% net speedup is measured with significant power reduction. The paper reveals that BDA workloads suffer from overheads related to managing the data rather than accessing the data. For example, Hash index lookup is found to be a key performance limiter. Inefficiencies leading to such unexpected behavior are demonstrated, including JVM selection and heavily unoptimized application code, both of which have a big impact. Suboptimal microarchitecture areas are demonstrated too, in addition to programming styles that limit exploitation of upcoming JVM and CPU parallelization features.

[1]  Lingjia Tang,et al.  The impact of memory subsystem resource sharing on datacenter applications , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[2]  Xiao Zhang,et al.  CPI2: CPU performance isolation for shared compute clusters , 2013, EuroSys '13.

[3]  Jie Huang,et al.  The HiBench benchmark suite: Characterization of the MapReduce-based data analysis , 2010, 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010).

[4]  Ahmad Yasin,et al.  A Top-Down method for performance analysis and counters architecture , 2014, 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[5]  Maged M. Michael,et al.  Experiences Understanding Performance in a Commercial Scale-Out Environment , 2007, Euro-Par.

[6]  Kevin Skadron,et al.  PRECISELY PREDICTING PERFORMANCE DEGRADATION DUE TO COLOCATING MULTIPLE EXECUTING APPLICATIONS ON A SINGLE MACHINE IS CRITICAL FOR IMPROVING UTILIZATION IN MODERN , 2012 .

[7]  M. Balazinska,et al.  An analysis of Hadoop usage in scientific workloads , 2013 .

[8]  Babak Falsafi,et al.  Meet the walkers accelerating index traversals for in-memory databases , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[9]  Luiz André Barroso,et al.  Web Search for a Planet: The Google Cluster Architecture , 2003, IEEE Micro.

[10]  Babak Falsafi,et al.  Scale-out processors , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[11]  Babak Falsafi,et al.  Clearing the clouds: a study of emerging scale-out workloads on modern hardware , 2012, ASPLOS XVII.

[12]  Luiz André Barroso,et al.  The tail at scale , 2013, CACM.

[13]  James Charles,et al.  Evaluation of the Intel® Core™ i7 Turbo Boost feature , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[14]  No License,et al.  Intel ® 64 and IA-32 Architectures Software Developer ’ s Manual Volume 3 A : System Programming Guide , Part 1 , 2006 .