论文信息 - CASPAR: Breaking Serialization in Lock-Free Multicore Synchronization

CASPAR: Breaking Serialization in Lock-Free Multicore Synchronization

In multicores, performance-critical synchronization is increasingly performed in a lock-free manner using atomic instructions such as CAS or LL/SC. However, when many processors synchronize on the same variable, performance can still degrade significantly. Contending writes get serialized, creating a non-scalable condition. Past proposals that build hardware queues of synchronizing processors do not fundamentally solve this problem---at best, they help to efficiently serialize the contending writes. This paper proposes a novel architecture that breaks the serialization of hardware queues and enables the queued processors to perform lock-free synchronization in parallel. The architecture, called CASPAR, is able to (1) execute the CASes in the queued-up processors in parallel through eager forwarding of expected values, and (2) validate the CASes in parallel and dequeue groups of processors at a time. The result is highly-scalable synchronization. We evaluate CASPAR with simulations of a 64-core chip. Compared to existing proposals with hardware queues, CASPAR improves the throughput of kernels by 32% on average, and reduces the execution time of the sections considered in lock-free versions of applications by 47% on average. This makes these sections 2.5x faster than in the original applications.

[1] Harry F. Jordan. Performance measurements on HEP - a pipelined MIMD computer , 1983, ISCA '83.

[2] Nir Shavit. Data structures in the multicore age , 2011, CACM.

[3] Douglas Thain,et al. Qthreads: An API for programming with millions of lightweight threads , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[4] Dan Alistarh,et al. The SprayList: a scalable relaxed priority queue , 2015, PPoPP.

[5] D. Lenoski,et al. The SGI Origin: A ccnuma Highly Scalable Server , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[6] Richard E. Jones,et al. The Garbage Collection Handbook: The art of automatic memory management , 2011, Chapman and Hall / CRC Applied Algorithms and Data Structures Series.

[7] Kevin P. McAuliffe,et al. The IBM Research Parallel Processor Prototype (RP3): Introduction and Architecture , 1985, ICPP.

[8] Dharmendra S. Modha,et al. CAR: Clock with Adaptive Replacement , 2004, FAST.

[9] Maurice Herlihy,et al. Wait-free synchronization , 1991, TOPL.

[10] Allan Porterfield,et al. OpenMP task scheduling strategies for multicore NUMA systems , 2012, Int. J. High Perform. Comput. Appl..

[11] James R. Goodman,et al. Efficient Synchronization: Let Them Eat QOLB , 1997, International Symposium on Computer Architecture.

[12] Keshav Pingali,et al. Optimistic parallelism requires abstractions , 2007, PLDI '07.

[13] Nir Shavit,et al. Skiplist-based concurrent priority queues , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[14] Yehuda Afek,et al. Quasi-Linearizability: Relaxed Consistency for Improved Concurrency , 2010, OPODIS.

[15] Nicholas D. Matsakis,et al. The rust language , 2014, HILT '14.

[16] Tarek S. Abdelrahman,et al. Hardware Support for Relaxed Concurrency Control in Transactional Memory , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[17] Anoop Gupta,et al. The Stanford Dash multiprocessor , 1992, Computer.

[18] Silas Boyd-Wickizer,et al. OpLog: a library for scaling update-heavy data structures , 2014 .

[19] Nir Shavit,et al. The Baskets Queue , 2007, OPODIS.

[20] Ralph Grishman,et al. The NYU ultracomputer—designing a MIMD, shared-memory parallel machine , 2018, ISCA '98.

[21] Per-Åke Larson,et al. Memory allocation for long-running server applications , 1998, ISMM '98.

[22] Nir Shavit,et al. A scalable lock-free stack algorithm , 2004, SPAA '04.

[23] Marc Shapiro,et al. A study of the scalability of stop-the-world garbage collectors on multicores , 2013, ASPLOS '13.

[24] Alejandro Duran,et al. Barcelona OpenMP Tasks Suite: A Set of Benchmarks Targeting the Exploitation of Task Parallelism in OpenMP , 2009, 2009 International Conference on Parallel Processing.

[25] Ralph Grishman,et al. The NYU Ultracomputer—Designing an MIMD Shared Memory Parallel Computer , 1983, IEEE Transactions on Computers.

[26] Maged M. Michael,et al. Nonblocking Algorithms and Preemption-Safe Locking on Multiprogrammed Shared Memory Multiprocessors , 1998, J. Parallel Distributed Comput..

[27] Josep Torrellas,et al. The impact of speeding up critical sections with data prefetching and forwarding , 1996, Proceedings of the 1996 ICPP Workshop on Challenges for Parallel Processing.

[28] Tarek S. Abdelrahman,et al. Relaxing concurrency control in transactional memory , 2011 .

[29] Ana Sokolova,et al. Distributed queues in shared memory: multicore performance and scalability through quantitative relaxation , 2013, CF '13.

[30] Milo M. K. Martin,et al. RETCON: transactional repair without replay , 2010, ISCA '10.

[31] Jim Jeffers. Intel® Xeon Phi™ Coprocessors , 2013 .

[32] Michael E. Thomadakis,et al. The Architecture of the Nehalem Processor and Nehalem-EP SMP Platforms , 2011 .

[33] Don Marti,et al. OSv - Optimizing the Operating System for Virtual Machines , 2014, USENIX Annual Technical Conference.

[34] D. M. Hutton,et al. The Art of Multiprocessor Programming , 2008 .

[35] Mateo Valero,et al. Architectural Support for Fair Reader-Writer Locking , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[36] Emmett Witchel,et al. Dependence-aware transactional memory for increased concurrency , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[37] Maged M. Michael. Hazard pointers: safe memory reclamation for lock-free objects , 2004, IEEE Transactions on Parallel and Distributed Systems.

[38] James R. Goodman,et al. Inferential Queueing and Speculative Push , 2003, ICS '03.

[39] T. N. Vijaykumar,et al. Wait-n-GoTM: improving HTM performance by serializing cyclic dependencies , 2013, ASPLOS '13.

[40] Maged M. Michael. Scalable lock-free dynamic memory allocation , 2004, PLDI '04.

[41] Josep Torrellas,et al. BulkSMT: Designing SMT processors for atomic-block execution , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[42] Jaejin Lee,et al. SFMalloc: A Lock-Free and Mostly Synchronization-Free Dynamic Memory Allocator for Manycores , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[43] Edward S. Davidson,et al. The Cedar system and an initial performance study , 1998, ISCA '98.

[44] Ana Sokolova,et al. Performance, Scalability, and Semantics of Concurrent FIFO Queues , 2012, ICA3PP.

[45] G ValiantLeslie. A bridging model for parallel computation , 1990 .

[46] James R. Goodman,et al. Improving the throughput of synchronization by insertion of delays , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[47] Keir Fraser,et al. Practical lock-freedom , 2003 .

[48] Mary K. Vernon,et al. Efficient synchronization primitives for large-scale cache-coherent multiprocessors , 1989, ASPLOS III.

[49] Josep Torrellas,et al. OmniOrder: Directory-based conflict serialization of transactions , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[50] Erez Petrank,et al. Wait-free queues with multiple enqueuers and dequeuers , 2011, PPoPP '11.

[51] Craig Freedman,et al. Hekaton: SQL server's memory-optimized OLTP engine , 2013, SIGMOD '13.

[52] Stefanos Kaxiras,et al. Complexity-effective multicore coherence , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[53] Steven L. Scott,et al. Synchronization and communication in the T3E multiprocessor , 1996, ASPLOS VII.

[54] Dimitrios S. Nikolopoulos,et al. Scalable locality-conscious multithreaded memory allocation , 2006, ISMM '06.

[55] Maged M. Michael,et al. Simple, fast, and practical non-blocking and blocking concurrent queue algorithms , 1996, PODC '96.

[56] Luís E. T. Rodrigues,et al. Virtues and limitations of commodity hardware transactional memory , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[57] Michael Stonebraker,et al. Enterprise Database Applications and the Cloud: A Difficult Road Ahead , 2014, 2014 IEEE International Conference on Cloud Engineering.

[58] Jeffrey H. Meyerson,et al. The Go Programming Language , 2014, IEEE Softw..

[59] Christoph M. Kirsch,et al. Fast and Scalable, Lock-Free k-FIFO Queues , 2013, PaCT.

[60] Lieven Eeckhout,et al. Sniper: Exploring the level of abstraction for scalable and accurate parallel multi-core simulation , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[61] Brian W. Kernighan,et al. The Go Programming Language , 2015 .

[62] Allan Porterfield,et al. The Tera computer system , 1990, ICS '90.

[63] Josep Torrellas,et al. Data forwarding in scalable shared-memory multiprocessors , 1995, ICS '95.

[64] Keshav Pingali,et al. A lightweight infrastructure for graph analytics , 2013, SOSP.

[65] Nir Shavit,et al. An optimistic approach to lock-free FIFO queues , 2004, Distributed Computing.

[66] Ronald G. Dreslinski,et al. Proactive transaction scheduling for contention management , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).