CASPAR: Breaking Serialization in Lock-Free Multicore Synchronization

In multicores, performance-critical synchronization is increasingly performed in a lock-free manner using atomic instructions such as CAS or LL/SC. However, when many processors synchronize on the same variable, performance can still degrade significantly. Contending writes get serialized, creating a non-scalable condition. Past proposals that build hardware queues of synchronizing processors do not fundamentally solve this problem---at best, they help to efficiently serialize the contending writes. This paper proposes a novel architecture that breaks the serialization of hardware queues and enables the queued processors to perform lock-free synchronization in parallel. The architecture, called CASPAR, is able to (1) execute the CASes in the queued-up processors in parallel through eager forwarding of expected values, and (2) validate the CASes in parallel and dequeue groups of processors at a time. The result is highly-scalable synchronization. We evaluate CASPAR with simulations of a 64-core chip. Compared to existing proposals with hardware queues, CASPAR improves the throughput of kernels by 32% on average, and reduces the execution time of the sections considered in lock-free versions of applications by 47% on average. This makes these sections 2.5x faster than in the original applications.

[1]  Harry F. Jordan Performance measurements on HEP - a pipelined MIMD computer , 1983, ISCA '83.

[2]  Nir Shavit Data structures in the multicore age , 2011, CACM.

[3]  Douglas Thain,et al.  Qthreads: An API for programming with millions of lightweight threads , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[4]  Dan Alistarh,et al.  The SprayList: a scalable relaxed priority queue , 2015, PPoPP.

[5]  D. Lenoski,et al.  The SGI Origin: A ccnuma Highly Scalable Server , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[6]  Richard E. Jones,et al.  The Garbage Collection Handbook: The art of automatic memory management , 2011, Chapman and Hall / CRC Applied Algorithms and Data Structures Series.

[7]  Kevin P. McAuliffe,et al.  The IBM Research Parallel Processor Prototype (RP3): Introduction and Architecture , 1985, ICPP.

[8]  Dharmendra S. Modha,et al.  CAR: Clock with Adaptive Replacement , 2004, FAST.

[9]  Maurice Herlihy,et al.  Wait-free synchronization , 1991, TOPL.

[10]  Allan Porterfield,et al.  OpenMP task scheduling strategies for multicore NUMA systems , 2012, Int. J. High Perform. Comput. Appl..

[11]  James R. Goodman,et al.  Efficient Synchronization: Let Them Eat QOLB , 1997, International Symposium on Computer Architecture.

[12]  Keshav Pingali,et al.  Optimistic parallelism requires abstractions , 2007, PLDI '07.

[13]  Nir Shavit,et al.  Skiplist-based concurrent priority queues , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[14]  Yehuda Afek,et al.  Quasi-Linearizability: Relaxed Consistency for Improved Concurrency , 2010, OPODIS.

[15]  Nicholas D. Matsakis,et al.  The rust language , 2014, HILT '14.

[16]  Tarek S. Abdelrahman,et al.  Hardware Support for Relaxed Concurrency Control in Transactional Memory , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[17]  Anoop Gupta,et al.  The Stanford Dash multiprocessor , 1992, Computer.

[18]  Silas Boyd-Wickizer,et al.  OpLog: a library for scaling update-heavy data structures , 2014 .

[19]  Nir Shavit,et al.  The Baskets Queue , 2007, OPODIS.

[20]  Ralph Grishman,et al.  The NYU ultracomputer—designing a MIMD, shared-memory parallel machine , 2018, ISCA '98.

[21]  Per-Åke Larson,et al.  Memory allocation for long-running server applications , 1998, ISMM '98.

[22]  Nir Shavit,et al.  A scalable lock-free stack algorithm , 2004, SPAA '04.

[23]  Marc Shapiro,et al.  A study of the scalability of stop-the-world garbage collectors on multicores , 2013, ASPLOS '13.

[24]  Alejandro Duran,et al.  Barcelona OpenMP Tasks Suite: A Set of Benchmarks Targeting the Exploitation of Task Parallelism in OpenMP , 2009, 2009 International Conference on Parallel Processing.

[25]  Ralph Grishman,et al.  The NYU Ultracomputer—Designing an MIMD Shared Memory Parallel Computer , 1983, IEEE Transactions on Computers.

[26]  Maged M. Michael,et al.  Nonblocking Algorithms and Preemption-Safe Locking on Multiprogrammed Shared Memory Multiprocessors , 1998, J. Parallel Distributed Comput..

[27]  Josep Torrellas,et al.  The impact of speeding up critical sections with data prefetching and forwarding , 1996, Proceedings of the 1996 ICPP Workshop on Challenges for Parallel Processing.

[28]  Tarek S. Abdelrahman,et al.  Relaxing concurrency control in transactional memory , 2011 .

[29]  Ana Sokolova,et al.  Distributed queues in shared memory: multicore performance and scalability through quantitative relaxation , 2013, CF '13.

[30]  Milo M. K. Martin,et al.  RETCON: transactional repair without replay , 2010, ISCA '10.

[31]  Jim Jeffers Intel® Xeon Phi™ Coprocessors , 2013 .

[32]  Michael E. Thomadakis,et al.  The Architecture of the Nehalem Processor and Nehalem-EP SMP Platforms , 2011 .

[33]  Don Marti,et al.  OSv - Optimizing the Operating System for Virtual Machines , 2014, USENIX Annual Technical Conference.

[34]  D. M. Hutton,et al.  The Art of Multiprocessor Programming , 2008 .

[35]  Mateo Valero,et al.  Architectural Support for Fair Reader-Writer Locking , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[36]  Emmett Witchel,et al.  Dependence-aware transactional memory for increased concurrency , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[37]  Maged M. Michael Hazard pointers: safe memory reclamation for lock-free objects , 2004, IEEE Transactions on Parallel and Distributed Systems.

[38]  James R. Goodman,et al.  Inferential Queueing and Speculative Push , 2003, ICS '03.

[39]  T. N. Vijaykumar,et al.  Wait-n-GoTM: improving HTM performance by serializing cyclic dependencies , 2013, ASPLOS '13.

[40]  Maged M. Michael Scalable lock-free dynamic memory allocation , 2004, PLDI '04.

[41]  Josep Torrellas,et al.  BulkSMT: Designing SMT processors for atomic-block execution , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[42]  Jaejin Lee,et al.  SFMalloc: A Lock-Free and Mostly Synchronization-Free Dynamic Memory Allocator for Manycores , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[43]  Edward S. Davidson,et al.  The Cedar system and an initial performance study , 1998, ISCA '98.

[44]  Ana Sokolova,et al.  Performance, Scalability, and Semantics of Concurrent FIFO Queues , 2012, ICA3PP.

[45]  G ValiantLeslie A bridging model for parallel computation , 1990 .

[46]  James R. Goodman,et al.  Improving the throughput of synchronization by insertion of delays , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[47]  Keir Fraser,et al.  Practical lock-freedom , 2003 .

[48]  Mary K. Vernon,et al.  Efficient synchronization primitives for large-scale cache-coherent multiprocessors , 1989, ASPLOS III.

[49]  Josep Torrellas,et al.  OmniOrder: Directory-based conflict serialization of transactions , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[50]  Erez Petrank,et al.  Wait-free queues with multiple enqueuers and dequeuers , 2011, PPoPP '11.

[51]  Craig Freedman,et al.  Hekaton: SQL server's memory-optimized OLTP engine , 2013, SIGMOD '13.

[52]  Stefanos Kaxiras,et al.  Complexity-effective multicore coherence , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[53]  Steven L. Scott,et al.  Synchronization and communication in the T3E multiprocessor , 1996, ASPLOS VII.

[54]  Dimitrios S. Nikolopoulos,et al.  Scalable locality-conscious multithreaded memory allocation , 2006, ISMM '06.

[55]  Maged M. Michael,et al.  Simple, fast, and practical non-blocking and blocking concurrent queue algorithms , 1996, PODC '96.

[56]  Luís E. T. Rodrigues,et al.  Virtues and limitations of commodity hardware transactional memory , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[57]  Michael Stonebraker,et al.  Enterprise Database Applications and the Cloud: A Difficult Road Ahead , 2014, 2014 IEEE International Conference on Cloud Engineering.

[58]  Jeffrey H. Meyerson,et al.  The Go Programming Language , 2014, IEEE Softw..

[59]  Christoph M. Kirsch,et al.  Fast and Scalable, Lock-Free k-FIFO Queues , 2013, PaCT.

[60]  Lieven Eeckhout,et al.  Sniper: Exploring the level of abstraction for scalable and accurate parallel multi-core simulation , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[61]  Brian W. Kernighan,et al.  The Go Programming Language , 2015 .

[62]  Allan Porterfield,et al.  The Tera computer system , 1990, ICS '90.

[63]  Josep Torrellas,et al.  Data forwarding in scalable shared-memory multiprocessors , 1995, ICS '95.

[64]  Keshav Pingali,et al.  A lightweight infrastructure for graph analytics , 2013, SOSP.

[65]  Nir Shavit,et al.  An optimistic approach to lock-free FIFO queues , 2004, Distributed Computing.

[66]  Ronald G. Dreslinski,et al.  Proactive transaction scheduling for contention management , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).