Quantitative Evaluation of Intel PEBS Overhead for Online System-Noise Analysis

Analyzing system-noise incurred to high-throughput systems (e.g., Spark, RDBMS) from the underlying machines must be in the granularity of the message- or request-level to find the root causes of performance anomalies, because messages are passed through many components in very short periods. To this end, we consider using Precise Event Based Sampling (PEBS) equipped in Intel CPUs at higher sampling rates than used normally is promising. It saves context information (e.g., the general purpose registers) at occurrences of various hardware events such as cache misses. The information can be used to associate performance anomalies caused by system noise with specific messages. One challenge is that quantitative analysis of PEBS overhead with high sampling rates has not yet been studied. This is critical because high sampling rates can cause severe overhead but performance problems are often reproducible only in real environments. In this paper, we evaluate the overhead of PEBS and show: (1) every time PEBS saves context information, the target workload slows down by 200-300 ns due to the CPU overhead of PEBS, (2) the CPU overhead can be used to predict actual overhead incurred with complex workloads including multi-threaded ones with high accuracy, and (3) PEBS incurs cache pollution and extra memory IO since PEBS writes data into the CPU cache, and the severity of cache pollution is affected both by the sampling rate and the buffer size allocated for PEBS. To the best of our knowledge, we are the first to quantitatively analyze the overhead of PEBS.

[1]  Matthias Hauswirth,et al.  Accuracy of performance counter measurements , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[2]  Tomoharu Ugawa,et al.  A proper performance evaluation system that summarizes code placement effects , 2013, PASTE '13.

[3]  Angela C. Sodan,et al.  Predicting cache needs and cache sensitivity for applications in cloud computing on CMP servers with configurable caches , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[4]  Avi Mendelson,et al.  Establishing a Base of Trust with Performance Counters for Enterprise Workloads , 2015, USENIX Annual Technical Conference.

[5]  Vincent M. Weaver Self-monitoring overhead of the Linux perf_ event performance counter interface , 2015, 2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[6]  Scott Shenker,et al.  Making Sense of Performance in Data Analytics Frameworks , 2015, NSDI.

[7]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[8]  Yu Luo,et al.  lprof: A Non-intrusive Request Flow Profiler for Distributed Systems , 2014, OSDI.

[9]  Shirley Moore,et al.  Non-determinism and overcount on modern hardware performance counter implementations , 2013, 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[10]  Juan Gonzalez,et al.  Low-Overhead Detection of Memory Access Patterns and Their Time Evolution , 2015, Euro-Par.

[11]  Ryousei Takano,et al.  DEMU: A DPDK-based network latency emulator , 2017, 2017 IEEE International Symposium on Local and Metropolitan Area Networks (LANMAN).

[12]  Jie Liu,et al.  Cuanta: quantifying effects of shared on-chip resource interference for consolidated virtual machines , 2011, SoCC.

[13]  Eliezer Yudkowsky,et al.  Fine-Grained Estimation of Memory Bandwidth Utilization , 2016 .