Featherlight on-the-fly false-sharing detection

Shared-memory parallel programs routinely suffer from false sharing---a performance degradation caused by different threads accessing different variables that reside on the same CPU cacheline and at least one variable is modified. State-of-the-art tools detect false sharing via a heavyweight process of logging memory accesses and feeding the ensuing access traces to an offline cache simulator. We have developed Feather, a lightweight, on-the-fly false-sharing detection tool. Feather achieves low overhead by exploiting two hardware features ubiquitous in commodity CPUs: the performance monitoring units (PMU) and debug registers. Additionally, Feather is a first-of-its-kind tool to detect false sharing in multi-process applications that use shared memory. Feather allowed us to scale false-sharing detection to myriad codes. Feather detected several false-sharing cases in important multi-core and multi-process codes including previous PPoPP artifacts. Eliminating false sharing resulted in dramatic (up to 16x) speedups.

[1]  Chen Tian,et al.  PREDATOR: predictive false sharing detection , 2014, PPoPP '14.

[2]  Weng-Fai Wong,et al.  Dynamic cache contention detection in multi-threaded applications , 2011, VEE '11.

[3]  John M. Mellor-Crummey,et al.  DeadSpy: a tool to pinpoint program inefficiencies , 2012, CGO '12.

[4]  Shasha Wen,et al.  An Efficient Abortable-locking Protocol for Multi-level NUMA Systems , 2017, PPoPP.

[5]  Dean M. Tullsen,et al.  Simultaneous multithreading: Maximizing on-chip parallelism , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[6]  Gerard J. Holzmann,et al.  The Model Checker SPIN , 1997, IEEE Trans. Software Eng..

[7]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[8]  Christoforos E. Kozyrakis,et al.  Phoenix rebirth: Scalable MapReduce on a large-scale shared-memory system , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[9]  Berkin Özisikyilmaz,et al.  MineBench: A Benchmark Suite for Data Mining Workloads , 2006, 2006 IEEE International Symposium on Workload Characterization.

[10]  E. Tammaru,et al.  Guidelines for creating a debuggable processor , 1982, ASPLOS I.

[11]  Emery D. Berger,et al.  SHERIFF: precise detection and automatic mitigation of false sharing , 2011, OOPSLA '11.

[12]  Robert Tappan Morris,et al.  Locating cache performance bottlenecks using data profiling , 2010, EuroSys '10.

[13]  Josef Weidendorfer,et al.  Assessing cache false sharing effects by dynamic binary instrumentation , 2009, WBIA '09.

[14]  Vincent Gramoli,et al.  More than you ever wanted to know about synchronization: synchrobench, measuring the impact of the synchronization on concurrent algorithms , 2015, PPoPP.

[15]  Christian Bienia,et al.  Benchmarking modern multiprocessors , 2011 .

[16]  Michael L. Scott,et al.  False sharing and its effect on shared memory performance , 1993 .

[17]  Barbara M. Chapman,et al.  Detecting False Sharing in OpenMP Applications Using the DARWIN Framework , 2011, LCPC.

[18]  Shiliang Hu,et al.  LASER: Light, Accurate Sharing dEtection and Repair , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[19]  John Byrne,et al.  Watching for Software Inefficiencies with Witch , 2018, ASPLOS.

[20]  Nathan R. Tallent,et al.  Binary analysis for measurement and attribution of program performance , 2009, PLDI '09.

[21]  Leslie Lamport,et al.  Concurrent reading and writing , 1977, Commun. ACM.

[22]  Sandeep Koranne,et al.  Boost C++ Libraries , 2011 .

[23]  Shiliang Hu,et al.  Remix: online detection and repair of cache contention for the JVM , 2016, PLDI.

[24]  Mark Scott Johnson Some requirements for architectural support of software debugging , 1982, ASPLOS I.

[25]  Bo Wu,et al.  ScaAnalyzer: a tool to identify memory scalability bottlenecks in parallel programs , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[26]  James R. Larus,et al.  Exploiting hardware performance counters with flow and context sensitive profiling , 1997, PLDI '97.

[27]  Yanbin Liu,et al.  Detection of false sharing using machine learning , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[28]  Robert Tappan Morris,et al.  An Analysis of Linux Scalability to Many Cores , 2010, OSDI.

[29]  Dutch T. Meyer,et al.  Whose cache line is it anyway?: operating system support for live detection and repair of false sharing , 2013, EuroSys '13.

[30]  Nathan Froyd,et al.  Scalability analysis of SPMD codes using expectations , 2007, ICS '07.

[31]  Balaram Sinharoy,et al.  IBM POWER7 performance modeling, verification, and evaluation , 2011 .

[32]  Susan L. Graham,et al.  Gprof: A call graph execution profiler , 1982, SIGPLAN '82.

[33]  Robert J. Hall,et al.  Call path profiling , 1992, International Conference on Software Engineering.

[34]  Dragan Bosnacki,et al.  The Design of a Multicore Extension of the SPIN Model Checker , 2007, IEEE Transactions on Software Engineering.

[35]  Nathan R. Tallent,et al.  HPCTOOLKIT: tools for performance analysis of optimized parallel programs , 2010, Concurr. Comput. Pract. Exp..

[36]  Xu Liu,et al.  Cheetah: Detecting false sharing efficiently and effectively , 2016, 2016 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[37]  Shasha Wen,et al.  REDSPY: Exploring Value Locality in Software , 2017, ASPLOS.