论文信息 - Fully Read/Write Fence-Free Work-Stealing with Multiplicity

Fully Read/Write Fence-Free Work-Stealing with Multiplicity

Work-stealing is a popular technique to implement dynamic load balancing in a distributed manner. In this approach, each process owns a set of tasks that have to be executed. The owner of the set can put tasks in it and can take tasks from it to execute them. When a process runs out of tasks, instead of being idle, it becomes a thief to steal tasks from a victim. Thus, a work-stealing algorithm provides three high-level operations: Put and Take, which can be invoked only by the owner, and Steal, which can be invoked by a thief. One of the main targets when designing work-stealing algorithms is to make Put and Take as simple and efficient as possible. Unfortunately, it has been shown that any work-stealing algorithm in the standard asynchronous model must use expensive Read-After-Write synchronization patterns or atomic Read-Modify-Write instructions (e.g. CompareS however, Put uses fences among Write instructions, and Steal uses Compare&Swap and fences among Read instructions. This paper considers work-stealing with multiplicity, a relaxation in which every task is taken by at least one operation, with the requirement that any process can extract a task at most once. Three versions of the relaxation are considered and fully Read/Write algorithms are presented in the standard asynchronous model, all of them devoid of Read-After-Write synchronization patterns; the last algorithm is also fully fence-free.

Armando Castaneda | Miguel Pina | Armando Castañeda | Miguel Pina

[1] Rachid Guerraoui,et al. Laws of order: expensive synchronization in concurrent algorithms cannot be eliminated , 2011, POPL '11.

[2] Maurice Herlihy,et al. Linearizability: a correctness condition for concurrent objects , 1990, TOPL.

[3] Mark Moir,et al. A dynamic-sized nonblocking work stealing deque , 2006, Distributed Computing.

[4] Michel Raynal,et al. Relaxed Queues and Stacks from Read/Write Operations , 2020, OPODIS.

[5] Gil Neiger,et al. Set-linearizability , 1994, PODC '94.

[6] Sam Toueg,et al. Time and Space Lower Bounds for Nonblocking Implementations , 2000, SIAM J. Comput..

[7] Francesco Zappa Nardelli,et al. 86-TSO : A Rigorous and Usable Programmer ’ s Model for x 86 Multiprocessors , 2010 .

[8] David A. Bader,et al. A fast, parallel spanning tree algorithm for symmetric multiprocessors , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[9] Yehuda Afek,et al. Fence-free work stealing on bounded TSO processors , 2014, ASPLOS.

[10] Vivek Sarkar,et al. X10: an object-oriented approach to non-uniform cluster computing , 2005, OOPSLA '05.

[11] Maged M. Michael,et al. Idempotent work stealing , 2009, PPoPP '09.

[12] Christoforos E. Kozyrakis,et al. Evaluating MapReduce for Multi-core and Multiprocessor Systems , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[13] D. M. Hutton,et al. The Art of Multiprocessor Programming , 2008 .

[14] John M. Mellor-Crummey,et al. A wait-free queue as fast as fetch-and-add , 2016, PPoPP.

[15] Nir Shavit,et al. Non-blocking steal-half work queues , 2002, PODC '02.

[16] Doug Lea,et al. A Java fork/join framework , 2000, JAVA '00.

[17] Michel Raynal,et al. Unifying Concurrent Objects and Distributed Tasks , 2018, J. ACM.

[18] Matei David,et al. A Single-Enqueuer Wait-Free Queue Implementation , 2004, DISC.

[19] David Chase,et al. Dynamic circular work-stealing deque , 2005, SPAA '05.

[20] Matteo Frigo,et al. The implementation of the Cilk-5 multithreaded language , 1998, PLDI.

[21] Nir Shavit,et al. Parallel Garbage Collection for Shared Memory Multiprocessors , 2001, Java Virtual Machine Research and Technology Symposium.

[22] Maurice Herlihy,et al. Impossibility results for asynchronous PRAM (extended abstract) , 1991, SPAA '91.

[23] Yehuda Afek,et al. Quasi-Linearizability: Relaxed Consistency for Improved Concurrency , 2010, OPODIS.

[24] Alejandro Duran,et al. The Design of OpenMP Tasks , 2009, IEEE Transactions on Parallel and Distributed Systems.

[25] Bradley C. Kuszmaul,et al. Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[26] Roy Friedman,et al. Brief Announcement: Jiffy: A Fast, Memory Efficient, Wait-Free Multi-Producers Single-Consumer Queue , 2020, DISC.

[27] Hagit Attiya,et al. Polylogarithmic concurrent data structures from monotone circuits , 2012, JACM.

[28] Maurice Herlihy,et al. Wait-free synchronization , 1991, TOPL.

[29] Prasad Jayanti,et al. Logarithmic-Time Single Deleter, Multiple Inserter Wait-Free Queues and Stacks , 2005, FSTTCS.