Evaluating the Cost of Atomic Operations on Modern Architectures

Atomic operations (atomics) such as Compare-and-Swap (CAS) or Fetch-and-Add (FAA) are ubiquitous in parallel programming. Yet, performance tradeoffs between these operations and various characteristics of such systems, such as the structure of caches, are unclear and have not been thoroughly analyzed. In this paper we establish an evaluation methodology, develop a performance model, and present a set of detailed benchmarks for latency and bandwidth of different atomics. We consider various state-of-the-art x86 architectures: Intel Haswell, Xeon Phi, Ivy Bridge, and AMD Bulldozer. The results unveil surprising performance relationships between the considered atomics and architectural properties such as the coherence state of the accessed cache lines. One key finding is that all the tested atomics have comparable latency and bandwidth even if they are characterized by different consensus numbers. Another insight is that the design of atomics prevents any instruction-level parallelism even if there are no dependencies between the issued operations. Finally, we discuss solutions to the discovered performance issues in the analyzed architectures. Our analysis can be used for making better design and algorithmic decisions in parallel programming on various architectures deployed in both off-the-shelf machines and large compute systems.

[1]  Karthikeyan Sankaralingam,et al.  Dark Silicon and the End of Multicore Scaling , 2012, IEEE Micro.

[2]  Maurice Herlihy,et al.  The Art of Multiprocessor Programming, Revised Reprint , 2012 .

[3]  Guang R. Gao,et al.  Toward high-throughput algorithms on many-core architectures , 2012, TACO.

[4]  Courtenay T Vaughan Application Characteristics and Performance on a Cray XE6. , 2011 .

[5]  Ulrich Brüning,et al.  A versatile, low latency HyperTransport core , 2007, FPGA '07.

[6]  Robert Schöne,et al.  Main memory and cache performance of intel sandy bridge and AMD bulldozer , 2014, MSPC@PLDI.

[7]  Torsten Hoefler,et al.  Fault tolerance for remote memory access programming models , 2014, HPDC '14.

[8]  Christos Faloutsos,et al.  Kronecker Graphs: An Approach to Modeling Networks , 2008, J. Mach. Learn. Res..

[9]  Wolfgang E. Nagel,et al.  Comparing cache architectures and coherency protocols on x86-64 multicore SMP systems , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[10]  Cheng Chen,et al.  A practical nonblocking queue algorithm using compare-and-swap , 2000, Proceedings Seventh International Conference on Parallel and Distributed Systems (Cat. No.PR00568).

[11]  George Coulouris,et al.  Distributed systems - concepts and design , 1988 .

[12]  Carl Staelin,et al.  Memory hierarchy performance measurement of commercial dual-core desktop processors , 2008, J. Syst. Archit..

[13]  Petr Tuma,et al.  Investigating Cache Parameters of x86 Family Processors , 2009, SPEC Benchmark Workshop.

[14]  Matthias Hauswirth,et al.  Producing wrong data without doing anything obviously wrong! , 2009, ASPLOS.

[15]  Hsien-Hsin S. Lee,et al.  Supporting cache coherence in heterogeneous multiprocessor systems , 2004, Proceedings Design, Automation and Test in Europe Conference and Exhibition.

[16]  Christopher J. Hughes,et al.  Performance evaluation of Intel® Transactional Synchronization Extensions for high-performance computing , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[17]  Matthias S. Müller,et al.  Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[18]  John S. Keen,et al.  Measuring Memory Hierarchy Performance of Cache-Coherent Multiprocessors Using Micro Benchmarks , 1997, ACM/IEEE SC 1997 Conference (SC'97).

[19]  Brian W. Barrett,et al.  Introducing the Graph 500 , 2010 .

[20]  No License,et al.  Intel ® 64 and IA-32 Architectures Software Developer ’ s Manual Volume 3 A : System Programming Guide , Part 1 , 2006 .

[21]  Torsten Hoefler,et al.  Enabling highly-scalable remote memory access programming with MPI-3 One Sided , 2014, Sci. Program..

[22]  Torsten Hoefler,et al.  Active Access: A Mechanism for High-Performance Distributed Data-Centric Computations , 2015, ICS.

[23]  Yehuda Afek,et al.  Fast concurrent queues for x86 processors , 2013, PPoPP '13.

[24]  John Shalf,et al.  Programming Abstractions for Data Locality , 2014 .

[25]  Nir Shavit,et al.  A Hierarchical CLH Queue Lock , 2006, Euro-Par.

[26]  Sabela Ramos,et al.  Modeling communication in cache-coherent SMP systems: a case-study with Xeon Phi , 2013, HPDC.

[27]  Tudor David,et al.  Everything you always wanted to know about synchronization but were afraid to ask , 2013, SOSP.

[28]  Canqun Yang,et al.  MilkyWay-2 supercomputer: system and application , 2014, Frontiers of Computer Science.

[29]  Torsten Hoefler,et al.  Accelerating Irregular Computations with Hardware Transactional Memory and Active Messages , 2015, HPDC.

[30]  Timothy L. Harris,et al.  A Pragmatic Implementation of Non-blocking Linked-Lists , 2001, DISC.

[31]  Wu-chun Feng,et al.  Design and Evaluation of Scalable Concurrent Queues for Many-Core Architectures , 2015, ICPE.

[32]  Maged M. Michael,et al.  Simple, fast, and practical non-blocking and blocking concurrent queue algorithms , 1996, PODC '96.

[33]  Allan Gottlieb,et al.  Operating system data structures for shared memory mimd machines with fetch-and-add , 1988 .