Fully Read/Write Fence-Free Work-Stealing with Multiplicity

Work-stealing is a popular technique to implement dynamic load balancing in a distributed manner. In this approach, each process owns a set of tasks that have to be executed. The owner of the set can put tasks in it and can take tasks from it to execute them. When a process runs out of tasks, instead of being idle, it becomes a thief to steal tasks from a victim. Thus, a work-stealing algorithm provides three high-level operations: Put and Take, which can be invoked only by the owner, and Steal, which can be invoked by a thief. One of the main targets when designing work-stealing algorithms is to make Put and Take as simple and efficient as possible. Unfortunately, it has been shown that any work-stealing algorithm in the standard asynchronous model must use expensive Read-After-Write synchronization patterns or atomic Read-Modify-Write instructions (e.g. CompareS however, Put uses fences among Write instructions, and Steal uses Compare&Swap and fences among Read instructions. This paper considers work-stealing with multiplicity, a relaxation in which every task is taken by at least one operation, with the requirement that any process can extract a task at most once. Three versions of the relaxation are considered and fully Read/Write algorithms are presented in the standard asynchronous model, all of them devoid of Read-After-Write synchronization patterns; the last algorithm is also fully fence-free.

[1]  Rachid Guerraoui,et al.  Laws of order: expensive synchronization in concurrent algorithms cannot be eliminated , 2011, POPL '11.

[2]  Maurice Herlihy,et al.  Linearizability: a correctness condition for concurrent objects , 1990, TOPL.

[3]  Mark Moir,et al.  A dynamic-sized nonblocking work stealing deque , 2006, Distributed Computing.

[4]  Michel Raynal,et al.  Relaxed Queues and Stacks from Read/Write Operations , 2020, OPODIS.

[5]  Gil Neiger,et al.  Set-linearizability , 1994, PODC '94.

[6]  Sam Toueg,et al.  Time and Space Lower Bounds for Nonblocking Implementations , 2000, SIAM J. Comput..

[7]  Francesco Zappa Nardelli,et al.  86-TSO : A Rigorous and Usable Programmer ’ s Model for x 86 Multiprocessors , 2010 .

[8]  David A. Bader,et al.  A fast, parallel spanning tree algorithm for symmetric multiprocessors , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[9]  Yehuda Afek,et al.  Fence-free work stealing on bounded TSO processors , 2014, ASPLOS.

[10]  Vivek Sarkar,et al.  X10: an object-oriented approach to non-uniform cluster computing , 2005, OOPSLA '05.

[11]  Maged M. Michael,et al.  Idempotent work stealing , 2009, PPoPP '09.

[12]  Christoforos E. Kozyrakis,et al.  Evaluating MapReduce for Multi-core and Multiprocessor Systems , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[13]  D. M. Hutton,et al.  The Art of Multiprocessor Programming , 2008 .

[14]  John M. Mellor-Crummey,et al.  A wait-free queue as fast as fetch-and-add , 2016, PPoPP.

[15]  Nir Shavit,et al.  Non-blocking steal-half work queues , 2002, PODC '02.

[16]  Doug Lea,et al.  A Java fork/join framework , 2000, JAVA '00.

[17]  Michel Raynal,et al.  Unifying Concurrent Objects and Distributed Tasks , 2018, J. ACM.

[18]  Matei David,et al.  A Single-Enqueuer Wait-Free Queue Implementation , 2004, DISC.

[19]  David Chase,et al.  Dynamic circular work-stealing deque , 2005, SPAA '05.

[20]  Matteo Frigo,et al.  The implementation of the Cilk-5 multithreaded language , 1998, PLDI.

[21]  Nir Shavit,et al.  Parallel Garbage Collection for Shared Memory Multiprocessors , 2001, Java Virtual Machine Research and Technology Symposium.

[22]  Maurice Herlihy,et al.  Impossibility results for asynchronous PRAM (extended abstract) , 1991, SPAA '91.

[23]  Yehuda Afek,et al.  Quasi-Linearizability: Relaxed Consistency for Improved Concurrency , 2010, OPODIS.

[24]  Alejandro Duran,et al.  The Design of OpenMP Tasks , 2009, IEEE Transactions on Parallel and Distributed Systems.

[25]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[26]  Roy Friedman,et al.  Brief Announcement: Jiffy: A Fast, Memory Efficient, Wait-Free Multi-Producers Single-Consumer Queue , 2020, DISC.

[27]  Hagit Attiya,et al.  Polylogarithmic concurrent data structures from monotone circuits , 2012, JACM.

[28]  Maurice Herlihy,et al.  Wait-free synchronization , 1991, TOPL.

[29]  Prasad Jayanti,et al.  Logarithmic-Time Single Deleter, Multiple Inserter Wait-Free Queues and Stacks , 2005, FSTTCS.