Load balancing is a technique which allows efficient parallelization of irregular workloads, and a key component of many applications and parallelizing runtimes. Work-stealing is a popular technique for implementing load balancing, where each parallel thread maintains its own work set of items and occasionally steals items from the sets of other threads.
The conventional semantics of work stealing guarantee that each inserted task is eventually extracted exactly once. However, correctness of a wide class of applications allows for relaxed semantics, because either: i) the application already explicitly checks that no work is repeated or ii) the application can tolerate repeated work.
In this paper, we introduce idempotent work tealing, and present several new algorithms that exploit the relaxed semantics to deliver better performance. The semantics of the new algorithms guarantee that each inserted task is eventually extracted at least once-instead of exactly once.
On mainstream processors, algorithms for conventional work stealing require special atomic instructions or store-load memory ordering fence instructions in the owner's critical path operations. In general, these instructions are substantially slower than regular memory access instructions. By exploiting the relaxed semantics, our algorithms avoid these instructions in the owner's operations.
We evaluated our algorithms using common graph problems and micro-benchmarks and compared them to well-known conventional work stealing algorithms, the THE Cilk and Chase-Lev algorithms. We found that our best algorithm (with LIFO extraction) outperforms existing algorithms in nearly all cases, and often by significant margins.
[1]
A BaderDavid,et al.
A fast, parallel spanning tree algorithm for symmetric multiprocessors (SMPs)
,
2005
.
[2]
Kenneth Kuhn,et al.
Principles of Operation
,
1998
.
[3]
David Chase,et al.
Dynamic circular work-stealing deque
,
2005,
SPAA '05.
[4]
Mark Moir,et al.
A dynamic-sized nonblocking work stealing deque
,
2006,
Distributed Computing.
[5]
Matteo Frigo,et al.
The implementation of the Cilk-5 multithreaded language
,
1998,
PLDI.
[6]
Maged M. Michael.
Practical Lock-Free and Wait-Free LL/SC/VL Implementations Using 64-Bit CAS
,
2004,
DISC.
[7]
Nir Shavit,et al.
Parallel Garbage Collection for Shared Memory Multiprocessors
,
2001,
Java Virtual Machine Research and Technology Symposium.
[8]
Nir Shavit,et al.
Non-blocking steal-half work queues
,
2002,
PODC '02.
[9]
Dennis J. Volper,et al.
Geometric retrieval in parallel
,
1988
.
[10]
Bradley C. Kuszmaul,et al.
Cilk: an efficient multithreaded runtime system
,
1995,
PPOPP '95.
[11]
Vivek Sarkar,et al.
X10: an object-oriented approach to non-uniform cluster computing
,
2005,
OOPSLA '05.
[12]
John Greiner,et al.
A comparison of parallel algorithms for connected components
,
1994,
SPAA '94.
[13]
David A. Bader,et al.
SIMPLE: A Methodology for Programming High Performance Algorithms on Clusters of Symmetric Multiprocessors (SMPs)
,
1998,
J. Parallel Distributed Comput..
[14]
C. Greg Plaxton,et al.
Thread Scheduling for Multiprogrammed Multiprocessors
,
1998,
SPAA '98.