Nowa: A Wait-Free Continuation-Stealing Concurrency Platform

It is an ongoing challenge to efficiently use parallelism with today’s multi- and many-core processors. Scalability becomes more crucial than ever with the rapidly growing number of processing elements in many-core systems that operate in data centres and embedded domains. Guaranteeing scalability is often ensured by using fully-strict fork/join concurrency, which is the prevalent approach used by concurrency platforms like Cilk. The runtime systems employed by those platforms typically resort to lock-based synchronisation due to the complex interactions of data structures within the runtime. However, locking limits scalability severely. With the availability of commercial off-the-shelf systems with hundreds of logical cores, this is becoming a problem for an increasing number of systems.This paper presents Nowa, a novel wait-free approach to arbitrate the plentiful concurrent strands managed by a concurrency platform’s runtime system. The wait-free approach is enabled by exploiting inherent properties of fully-strict fork/join concurrency, and hence is potentially applicable for every continuation-stealing runtime system of a concurrency platform. We have implemented Nowa and compared it with existing runtime systems, including Cilk Plus, and Threading Building Blocks (TBB), which employ a lock-based approach. Our evaluation results show that the wait-free implementation increases the performance up to 1.64× compared to lock-based ones, on a system with 256 hardware threads. The performance increased by 1.17× on average, while no but one benchmark exhibited performance regression. Compared against OpenMP tasks using Clang’s libomp, Nowa outperforms OpenMP by 8.68× on average.

[1]  Umut A. Acar,et al.  Contention in Structured Concurrency: Provably Efficient Dynamic Non-Zero Indicators for Nested Parallelism , 2017, PPoPP.

[2]  Emery D. Berger,et al.  STABILIZER: statistically sound performance evaluation , 2013, ASPLOS '13.

[3]  Brian Demsky,et al.  CDSchecker: checking concurrent data structures written with C/C++ atomics , 2013, OOPSLA.

[4]  Lars Bauer,et al.  System Software for Resource Arbitration on Future Many-* Architectures , 2020, 2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[5]  Vivek Sarkar,et al.  X10: an object-oriented approach to non-uniform cluster computing , 2005, OOPSLA '05.

[6]  Charles E. Leiserson,et al.  Tapir: Embedding Fork-Join Parallelism into LLVM's Intermediate Representation , 2017, PPoPP.

[7]  C. Greg Plaxton,et al.  Thread Scheduling for Multiprogrammed Multiprocessors , 1998, SPAA '98.

[8]  Matteo Frigo,et al.  The implementation of the Cilk-5 multithreaded language , 1998, PLDI.

[9]  Charles E. Leiserson,et al.  Brief Announcement: Open Cilk , 2018, SPAA.

[10]  Silas Boyd-Wickizer,et al.  Using memory mapping to support cactus stacks in work-stealing runtime systems , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[11]  Benjamin A. Dent,et al.  Burroughs' B6500/B7500 stack mechanism , 1968, AFIPS '68 (Spring).

[12]  Brian Demsky,et al.  Checking Concurrent Data Structures Under the C/C++11 Memory Model , 2017, PPOPP.

[13]  Hans-Juergen Boehm,et al.  HP Laboratories , 2006 .

[14]  David Chase,et al.  Dynamic circular work-stealing deque , 2005, SPAA '05.

[15]  Maurice Herlihy,et al.  Wait-free synchronization , 1991, TOPL.

[16]  John M. Mellor-Crummey,et al.  A Practical Solution to the Cactus Stack Problem , 2016, SPAA.

[17]  Albert Cohen,et al.  Correct and efficient work-stealing for weak memory models , 2013, PPoPP '13.

[18]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.