Using Differential Execution Analysis to Identify Thread Interference

Understanding the performance of a multi-threaded application is difficult. The threads interfere when they access the same shared resource, which slows down their execution. Unfortunately, current profiling tools report the hardware components or the synchronization primitives that saturate, but they cannot tell if the saturation is the cause of a performance bottleneck. In this paper, we propose a holistic metric able to pinpoint the blocks of code that suffer interference the most, regardless of the interference cause. Our metric uses performance variation as a universal indicator of interference problems. With an evaluation of 27 applications we show that our metric can identify interference problems caused by six different kinds of interference in nine applications. We are able to easily remove seven of the bottlenecks, which leads to a performance improvement of up to nine times.

[1]  B. Jacob,et al.  CMP $ im : A Pin-Based OnThe-Fly Multi-Core Cache Simulator , 2008 .

[2]  Thomas F. Wenisch,et al.  Statistical Analysis of Latency Through Semantic Profiling , 2017, EuroSys.

[3]  Erik R. Altman,et al.  Performance analysis of idle programs , 2010, OOPSLA.

[4]  Greg Bronevetsky,et al.  Active Measurement of the Impact of Network Switch Utilization on Application Performance , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[5]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[6]  Julia L. Lawall,et al.  Remote Core Locking: Migrating Critical-Section Execution to Improve the Performance of Multithreaded Applications , 2012, USENIX Annual Technical Conference.

[7]  Alexandra Fedorova,et al.  Deconstructing the overhead in parallel applications , 2012, 2012 IEEE International Symposium on Workload Characterization (IISWC).

[8]  Mary Lou Soffa,et al.  Contention aware execution: online contention detection and response , 2010, CGO '10.

[9]  Dutch T. Meyer,et al.  Whose cache line is it anyway?: operating system support for live detection and repair of false sharing , 2013, EuroSys '13.

[10]  Katherine E. Isaacs,et al.  There goes the neighborhood: Performance degradation due to nearby jobs , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[11]  Bo Wu,et al.  ScaAnalyzer: a tool to identify memory scalability bottlenecks in parallel programs , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[12]  Emery D. Berger,et al.  SHERIFF: precise detection and automatic mitigation of false sharing , 2011, OOPSLA '11.

[13]  Manuel Selva,et al.  NumaMMA: NUMA MeMory Analyzer , 2018, ICPP.

[14]  Nathan R. Tallent,et al.  Analyzing lock contention in multithreaded applications , 2010, PPoPP '10.

[15]  Wolfgang Karl,et al.  CacheIn: A Toolset for Comprehensive Cache Inspection , 2005, International Conference on Computational Science.

[16]  Min Zhou,et al.  Experiences and lessons learned with a portable interface to hardware performance counters , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[17]  Nikolai Joukov,et al.  Operating system profiling via latency analysis , 2006, OSDI '06.

[18]  Daniel Hagimont,et al.  Application-specific quantum for multi-core platform scheduler , 2016, EuroSys.

[19]  Nathan Froyd,et al.  Scalability analysis of SPMD codes using expectations , 2007, ICS '07.

[20]  Guojing Cong,et al.  A framework for automated performance bottleneck detection , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[21]  Tao Li,et al.  Optimizing virtual machine consolidation performance on NUMA server architecture for cloud workloads , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[22]  Emery D. Berger,et al.  Coz: finding code that counts with causal profiling , 2015, USENIX Annual Technical Conference.

[23]  Michael A. Frumkin,et al.  Benchmarking Memory Performance with the Data Cube Operator , 2004, PDCS.

[24]  S. Eranian Perfmon2: a flexible performance monitoring interface for Linux , 2010 .

[25]  Christoforos E. Kozyrakis,et al.  Evaluating MapReduce for Multi-core and Multiprocessor Systems , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[26]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[27]  Jack J. Dongarra,et al.  EZTrace: A Generic Framework for Performance Analysis , 2011, 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[28]  François Trahay,et al.  Runtime Function Instrumentation with EZTrace , 2012, Euro-Par Workshops.

[29]  Weng-Fai Wong,et al.  Dynamic cache contention detection in multi-threaded applications , 2011, VEE '11.

[30]  Michael L. Scott,et al.  False sharing and its effect on shared memory performance , 1993 .

[31]  Vivien Quéma,et al.  MemProf: A Memory Profiler for NUMA Multicore Systems , 2012, USENIX Annual Technical Conference.

[32]  Jose Renau,et al.  Analysis of PARSEC workload scalability , 2016, 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[33]  Dongmei Zhang,et al.  Comprehending performance from real-world execution traces: a device-driver case , 2014, ASPLOS.

[34]  Shan Lu,et al.  Statistical debugging for real-world performance problems , 2014, OOPSLA.

[35]  Julia L. Lawall,et al.  Continuously measuring critical section pressure with the free-lunch profiler , 2014, OOPSLA.

[36]  Stijn Eyerman,et al.  Speedup stacks: Identifying scaling bottlenecks in multi-threaded applications , 2012, 2012 IEEE International Symposium on Performance Analysis of Systems & Software.

[37]  Thomas Rauber,et al.  Trace-based Automatic Padding for Locality Improvement with Correlative Data Visualization Interface , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[38]  Yu Luo,et al.  Non-Intrusive Performance Profiling for Entire Software Stacks Based on the Flow Reconstruction Principle , 2016, OSDI.

[39]  Stijn Eyerman,et al.  Criticality stacks: identifying critical threads in parallel programs using synchronization behavior , 2013, ISCA.

[40]  Robert Tappan Morris,et al.  Locating cache performance bottlenecks using data profiling , 2010, EuroSys '10.

[41]  François Trahay,et al.  Selecting Points of Interest in Traces Using Patterns of Events , 2015, 2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing.

[42]  Xi Chen,et al.  Cache contention and application performance prediction for multi-core systems , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[43]  Yuriy Brun,et al.  Mining precise performance-aware behavioral models from existing instrumentation , 2014, ICSE Companion.

[44]  Chen Tian,et al.  PREDATOR: predictive false sharing detection , 2014, PPoPP '14.

[45]  M. ScholarWorks,et al.  Cheetah : Detecting False Sharing Efficiently and Effectively , 2019 .

[46]  Josef Weidendorfer,et al.  Assessing cache false sharing effects by dynamic binary instrumentation , 2009, WBIA '09.

[47]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[48]  Julia L. Lawall,et al.  Fast and Portable Locking for Multicore Architectures , 2016, ACM Trans. Comput. Syst..

[49]  Brad Fitzpatrick,et al.  Distributed caching with memcached , 2004 .

[50]  Vivien Quéma,et al.  Traffic management: a holistic approach to memory placement on NUMA systems , 2013, ASPLOS '13.

[51]  Yanbin Liu,et al.  Detection of false sharing using machine learning , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).