CounterMiner: Mining Big Performance Data from Hardware Counters

Modern processors typically provide a small number of hardware performance counters to capture a large number of microarchitecture events. These counters can easily generate a huge amount (e.g., GB or TB per day) of data, which we call big performance data in cloud computing platforms with more than thousands of servers and millions of complex workloads running in a "24/7/365" manner. The big performance data provides a precious foundation for root cause analysis of performance bottlenecks, architecture and compiler optimization, and many more. However, it is challenging to extract value from the big performance data due to: 1) the many unperceivable errors (e.g., outliers and missing values); and 2) the difficulty of obtaining insights, e.g., relating events to performance. In this paper, we propose CounterMiner, a rigorous methodology that enables the measurement and understanding of big performance data by using data mining and machine learning techniques. It includes three novel components: 1) using data cleaning to improve data quality by replacing outliers and filling in missing values; 2) iteratively quantifying, ranking, and pruning events based on their importance with respect to performance; 3) quantifying interaction intensity between two events by residual variance. We use sixteen benchmarks (eight from CloudSuite and eight from the Spark version of HiBench) to evaluate CounterMiner. The experimental results show that CounterMiner reduces the average error from 28.3% to 7.7% when multiplexing 10 events on 4 hardware counters. We also conduct a real-world case study, showing that identifying important configuration parameters of Spark programs by event importance is much faster than directly ranking the importance of these parameters.

[1]  Simon Fraser User-level scheduling on NUMA multicore systems under Linux , 2011 .

[2]  Jian Pei,et al.  2012- Data Mining. Concepts and Techniques, 3rd Edition.pdf , 2012 .

[3]  Alexandra Fedorova,et al.  A case for NUMA-aware contention management on multicore systems , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[4]  Ahmad Yasin,et al.  A Top-Down method for performance analysis and counters architecture , 2014, 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[5]  Gu-Yeon Wei,et al.  Profiling a Warehouse-Scale Computer , 2016, IEEE Micro.

[6]  No License,et al.  Intel ® 64 and IA-32 Architectures Software Developer ’ s Manual Volume 3 A : System Programming Guide , Part 1 , 2006 .

[7]  Avi Mendelson,et al.  Deep-dive analysis of the data analytics workload in CloudSuite , 2014, 2014 IEEE International Symposium on Workload Characterization (IISWC).

[8]  James R. Larus,et al.  Exploiting hardware performance counters with flow and context sensitive profiling , 1997, PLDI '97.

[9]  J. Friedman Stochastic gradient boosting , 2002 .

[10]  Nathan R. Tallent,et al.  HPCTOOLKIT: tools for performance analysis of optimized parallel programs , 2010, Concurr. Comput. Pract. Exp..

[11]  Allen D. Malony,et al.  PerfExplorer: A Performance Data Mining Framework For Large-Scale Parallel Computing , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[12]  Babak Falsafi,et al.  Clearing the clouds: a study of emerging scale-out workloads on modern hardware , 2012, ASPLOS XVII.

[13]  Timothy Roscoe,et al.  So many performance events, so little time , 2016, APSys.

[14]  Christina Delimitrou,et al.  Quasar: resource-efficient and QoS-aware cluster management , 2014, ASPLOS.

[15]  Brad Calder,et al.  Automatically characterizing large scale program behavior , 2002, ASPLOS X.

[16]  Yuqing Zhu,et al.  BigDataBench: A big data benchmark suite from internet services , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[17]  Jerome H Friedman,et al.  Multiple additive regression trees with application in epidemiology , 2003, Statistics in medicine.

[18]  Wenguang Chen,et al.  Taming hardware event samples for FDO compilation , 2010, CGO '10.

[19]  Isaac D. Scherson,et al.  Computationally Efficient Multiplexing of Events on Hardware Counters , 2014 .

[20]  Min Zhou,et al.  Experiences and lessons learned with a portable interface to hardware performance counters , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[21]  Christina Delimitrou,et al.  Bolt: I Know What You Did Last Summer... In The Cloud , 2017, ASPLOS.

[22]  John M. May,et al.  MPX: Software for multiplexing hardware performance counters in multithreaded programs , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.

[23]  Bambos Nicholas,et al.  Reliable and Efficient Performance Monitoring in Linux , 2016 .

[24]  Jeffrey S. Vetter,et al.  Scalable Analysis Techniques for Microprocessor Performance Counter Metrics , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[25]  Matthias Hauswirth,et al.  Time Interpolation: So Many Metrics, So Few Registers , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[26]  Sriram Sankar,et al.  Server Engineering Insights for Large-Scale Online Services , 2010, IEEE Micro.

[27]  N. Altman An Introduction to Kernel and Nearest-Neighbor Nonparametric Regression , 1992 .

[28]  Michael F. P. O'Boyle,et al.  Mapping parallelism to multi-cores: a machine learning based approach , 2009, PPoPP '09.

[29]  Shirley Moore,et al.  Non-determinism and overcount on modern hardware performance counter implementations , 2013, 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[30]  Allen D. Malony,et al.  Design and implementation of a parallel performance data management framework , 2005, 2005 International Conference on Parallel Processing (ICPP'05).

[31]  Hai Jin,et al.  FractalMRC: Online Cache Miss Rate Curve Prediction on Commodity Systems , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[32]  William Jalby,et al.  Hardware Performance Monitoring for the Rest of Us: A Position and Survey , 2011, NPC.

[33]  Donald J. Berndt,et al.  Using Dynamic Time Warping to Find Patterns in Time Series , 1994, KDD Workshop.

[34]  Matthias Hauswirth,et al.  Producing wrong data without doing anything obviously wrong! , 2009, ASPLOS.

[35]  Quan Chen,et al.  CATS: cache aware task-stealing based on online profiling in multi-socket multi-core architectures , 2012, ICS '12.

[36]  Christina Delimitrou,et al.  HCloud: Resource-Efficient Provisioning in Shared Cloud Systems , 2016, ASPLOS.

[37]  Chunjie Luo,et al.  Characterizing data analysis workloads in data centers , 2013, 2013 IEEE International Symposium on Workload Characterization (IISWC).

[38]  Lieven Eeckhout,et al.  ShenZhen transportation system (SZTS): a novel big data benchmark suite , 2016, The Journal of Supercomputing.

[39]  Xiao Zhang,et al.  CPI2: CPU performance isolation for shared compute clusters , 2013, EuroSys '13.

[40]  Dirk Grunwald,et al.  OptiScope: Performance Accountability for Optimizing Compilers , 2009, 2009 International Symposium on Code Generation and Optimization.

[41]  Jeanine Cook,et al.  Toward Accurate Performance Evaluation using Hardware Counters , 2003 .

[42]  Gang Ren,et al.  Google-Wide Profiling: A Continuous Profiling Infrastructure for Data Centers , 2010, IEEE Micro.

[43]  Christina Delimitrou,et al.  Paragon: QoS-aware scheduling for heterogeneous datacenters , 2013, ASPLOS '13.

[44]  Sally A. McKee,et al.  Can hardware performance counters be trusted? , 2008, 2008 IEEE International Symposium on Workload Characterization.

[45]  Michael Stumm,et al.  RapidMRC: approximating L2 miss rate curves on commodity systems for online optimizations , 2009, ASPLOS.

[46]  Jack J. Dongarra,et al.  A Portable Programming Interface for Performance Evaluation on Modern Processors , 2000, Int. J. High Perform. Comput. Appl..