Dynamic cache contention detection in multi-threaded applications

In today's multi-core systems, cache contention due to true and false sharing can cause unexpected and significant performance degradation. A detailed understanding of a given multi-threaded application's behavior is required to precisely identify such performance bottlenecks. Traditionally, however, such diagnostic information can only be obtained after lengthy simulation of the memory hierarchy. In this paper, we present a novel approach that efficiently analyzes interactions between threads to determine thread correlation and detect true and false sharing. It is based on the following key insight: although the slowdown caused by cache contention depends on factors including the thread-to-core binding and parameters of the memory hierarchy, the amount of data sharing is primarily a function of the cache line size and application behavior. Using memory shadowing and dynamic instrumentation, we implemented a tool that obtains detailed sharing information between threads without simulating the full complexity of the memory hierarchy. The runtime overhead of our approach --- a 5x slowdown on average relative to native execution --- is significantly less than that of detailed cache simulation. The information collected allows programmers to identify the degree of cache contention in an application, the correlation among its threads, and the sources of significant false sharing. Using our approach, we were able to improve the performance of some applications up to a factor of 12x. For other contention-intensive applications, we were able to shed light on the obstacles that prevent their performance from scaling to many cores.

[1]  B. Jacob,et al.  CMP $ im : A Pin-Based OnThe-Fly Multi-Core Cache Simulator , 2008 .

[2]  Qin Zhao,et al.  Efficient memory shadowing for 64-bit architectures , 2010, ISMM '10.

[3]  Kathryn S. McKinley,et al.  Hoard: a scalable memory allocator for multithreaded applications , 2000, SIGP.

[4]  Christoforos E. Kozyrakis,et al.  Evaluating MapReduce for Multi-core and Multiprocessor Systems , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[5]  Larry Rudolph,et al.  How to Do a Million Watchpoints: Efficient Debugging Using Dynamic Instrumentation , 2008, CC.

[6]  Willy Zwaenepoel,et al.  Implementation and performance of Munin , 1991, SOSP '91.

[7]  James H. Anderson,et al.  On the Design and Implementation of a Cache-Aware Multicore Real-Time Scheduler , 2009, 2009 21st Euromicro Conference on Real-Time Systems.

[8]  Livio Ricciulli,et al.  The detection and elimination of useless misses in multiprocessors , 1993, ISCA '93.

[9]  Satish Narayanasamy,et al.  Automatic logging of operating system effects to guide application-level architecture simulation , 2006, SIGMETRICS '06/Performance '06.

[10]  Michael Ott,et al.  Latencies of Conflicting Writes on Contemporary Multicore Architectures , 2007, PaCT.

[11]  Michael L. Scott,et al.  False sharing and its effect on shared memory performance , 1993 .

[12]  Michael Burrows,et al.  Run-Time Type Checking for Binary Programs , 2003, CC.

[13]  Cheng Wang,et al.  LIFT: A Low-Overhead Practical Information Flow Tracking System for Detecting Security Attacks , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[14]  Larry Rudolph,et al.  Ubiquitous Memory Introspection , 2007, CGO.

[15]  Weng-Fai Wong,et al.  General-purpose operating systems, such as Linux, , 2007, International Symposium on Code Generation and Optimization (CGO'07).

[16]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[17]  Nicholas Nethercote,et al.  Using Valgrind to Detect Undefined Value Errors with Bit-Precision , 2005, USENIX Annual Technical Conference, General Track.

[18]  Robert O. Hastings,et al.  Fast detection of memory leaks and access errors , 1991 .

[19]  Brian T. Lewis,et al.  Thread Scheduling for Multi-Core Platforms , 2007, HotOS.

[20]  Josef Weidendorfer,et al.  Assessing cache false sharing effects by dynamic binary instrumentation , 2009, WBIA '09.

[21]  Michael Burrows,et al.  Eraser: a dynamic data race detector for multithreaded programs , 1997, TOCS.

[22]  Vivek Khera,et al.  An Architecture-Independent Analysis of False Sharing , 1993 .

[23]  Jerry J. Harrow Runtime Checking of Multithreaded Applications with Visual Threads , 2000, SPIN.

[24]  Jih-Kwon Peir,et al.  Minimum Distance: A Method for Partitioning Recurrences for Multiprocessors , 1989, IEEE Trans. Computers.

[25]  Henry G. Dietz,et al.  Reduction of Cache Coherence Overhead by Compiler Data Layout and Loop Transformation , 1991, LCPC.

[26]  Susan J. Eggers,et al.  Reducing false sharing on shared memory multiprocessors through compile time data transformations , 1995, PPOPP '95.

[27]  Alan Mycroft,et al.  Redux: A Dynamic Dataflow Tracer , 2003, RV@CAV.

[28]  Koen De Bosschere,et al.  An efficient data race detector backend for DIOTA , 2003, PARCO.

[29]  Bei Yu,et al.  TaintTrace: Efficient Flow Tracing with Dynamic Binary Rewriting , 2006, 11th IEEE Symposium on Computers and Communications (ISCC'06).

[30]  Wolfgang Karl,et al.  CacheIn: A Toolset for Comprehensive Cache Inspection , 2005, International Conference on Computational Science.

[31]  Qin Zhao,et al.  Umbra: efficient and scalable memory shadowing , 2010, CGO '10.

[32]  Derek Bruening,et al.  Efficient, transparent, and comprehensive runtime code manipulation , 2004 .

[33]  Nicholas Nethercote,et al.  Valgrind: a framework for heavyweight dynamic binary instrumentation , 2007, PLDI '07.

[34]  Margo Seltzer,et al.  Operating system scheduling for chip multithreaded processors , 2006 .

[35]  Michael Stumm,et al.  Thread clustering: sharing-aware scheduling on SMP-CMP-SMT multiprocessors , 2007, EuroSys '07.

[36]  James Newsome,et al.  Dynamic Taint Analysis for Automatic Detection, Analysis, and SignatureGeneration of Exploits on Commodity Software , 2005, NDSS.

[37]  Surendar Chandra,et al.  Thread Migration to Improve Synchronization Performance , 2006 .

[38]  Gregory R. Andrews,et al.  Dynamically controlling false sharing in distributed shared memory , 1996, Proceedings of 5th IEEE International Symposium on High Performance Distributed Computing.