论文信息 - Nowa: A Wait-Free Continuation-Stealing Concurrency Platform

Nowa: A Wait-Free Continuation-Stealing Concurrency Platform

It is an ongoing challenge to efficiently use parallelism with today’s multi- and many-core processors. Scalability becomes more crucial than ever with the rapidly growing number of processing elements in many-core systems that operate in data centres and embedded domains. Guaranteeing scalability is often ensured by using fully-strict fork/join concurrency, which is the prevalent approach used by concurrency platforms like Cilk. The runtime systems employed by those platforms typically resort to lock-based synchronisation due to the complex interactions of data structures within the runtime. However, locking limits scalability severely. With the availability of commercial off-the-shelf systems with hundreds of logical cores, this is becoming a problem for an increasing number of systems.This paper presents Nowa, a novel wait-free approach to arbitrate the plentiful concurrent strands managed by a concurrency platform’s runtime system. The wait-free approach is enabled by exploiting inherent properties of fully-strict fork/join concurrency, and hence is potentially applicable for every continuation-stealing runtime system of a concurrency platform. We have implemented Nowa and compared it with existing runtime systems, including Cilk Plus, and Threading Building Blocks (TBB), which employ a lock-based approach. Our evaluation results show that the wait-free implementation increases the performance up to 1.64× compared to lock-based ones, on a system with 256 hardware threads. The performance increased by 1.17× on average, while no but one benchmark exhibited performance regression. Compared against OpenMP tasks using Clang’s libomp, Nowa outperforms OpenMP by 8.68× on average.

[1] Umut A. Acar,et al. Contention in Structured Concurrency: Provably Efficient Dynamic Non-Zero Indicators for Nested Parallelism , 2017, PPoPP.

[2] Emery D. Berger,et al. STABILIZER: statistically sound performance evaluation , 2013, ASPLOS '13.

[3] Brian Demsky,et al. CDSchecker: checking concurrent data structures written with C/C++ atomics , 2013, OOPSLA.

[4] Lars Bauer,et al. System Software for Resource Arbitration on Future Many-* Architectures , 2020, 2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[5] Vivek Sarkar,et al. X10: an object-oriented approach to non-uniform cluster computing , 2005, OOPSLA '05.

[6] Charles E. Leiserson,et al. Tapir: Embedding Fork-Join Parallelism into LLVM's Intermediate Representation , 2017, PPoPP.

[7] C. Greg Plaxton,et al. Thread Scheduling for Multiprogrammed Multiprocessors , 1998, SPAA '98.

[8] Matteo Frigo,et al. The implementation of the Cilk-5 multithreaded language , 1998, PLDI.

[9] Charles E. Leiserson,et al. Brief Announcement: Open Cilk , 2018, SPAA.

[10] Silas Boyd-Wickizer,et al. Using memory mapping to support cactus stacks in work-stealing runtime systems , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[11] Benjamin A. Dent,et al. Burroughs' B6500/B7500 stack mechanism , 1968, AFIPS '68 (Spring).

[12] Brian Demsky,et al. Checking Concurrent Data Structures Under the C/C++11 Memory Model , 2017, PPOPP.

[13] Hans-Juergen Boehm,et al. HP Laboratories , 2006 .

[14] David Chase,et al. Dynamic circular work-stealing deque , 2005, SPAA '05.

[15] Maurice Herlihy,et al. Wait-free synchronization , 1991, TOPL.

[16] John M. Mellor-Crummey,et al. A Practical Solution to the Cactus Stack Problem , 2016, SPAA.

[17] Albert Cohen,et al. Correct and efficient work-stealing for weak memory models , 2013, PPoPP '13.

[18] Bradley C. Kuszmaul,et al. Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.