Adaptive Cache Aware Bitier Work-Stealing in Multisocket Multicore Architectures

Modern multicore computers often adopt a multisocket multicore architecture with shared caches in each socket. However, traditional work-stealing schedulers tend to pollute the shared cache and incur more cache misses due to their random stealing. To relieve this problem, this paper proposes an Adaptive Cache-Aware Bi-tier work-stealing (A-CAB) scheduler. A-CAB improves the performance of memory-bound applications by reducing memory footprint and cache misses of tasks running inside the same CPU socket. A-CAB adaptively uses a DAG partitioner to divide an execution Directed Acyclic Graph (DAG) into the intersocket tier and the intrasocket tier. Tasks in the intersocket tier are scheduled across sockets while tasks in the intrasocket tier are scheduled within the same socket. Experimental results tell us that A-CAB can improve the performance of memory-bound applications up to 74.4 percent compared with the traditional work-stealing.

[1]  Tao Yang,et al.  A Comparison of Clustering Heuristics for Scheduling Directed Acycle Graphs on Multiprocessors , 1992, J. Parallel Distributed Comput..

[2]  Jens Palsberg,et al.  Featherweight X10: a core calculus for async-finish parallelism , 2010, PPoPP '10.

[3]  Michael Stumm,et al.  Online performance analysis by statistical sampling of microprocessor performance counters , 2005, ICS '05.

[4]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[5]  M. Berger,et al.  Adaptive mesh refinement for hyperbolic partial differential equations , 1982 .

[6]  Richard Cole,et al.  Analysis of Randomized Work Stealing with False Sharing , 2011, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[7]  Mark Moir,et al.  A dynamic-sized nonblocking work stealing deque , 2006, Distributed Computing.

[8]  David R. Butenhof Programming with POSIX threads , 1993 .

[9]  Xiaoning Ding,et al.  ULCC: a user-level facility for optimizing shared cache performance on multicores , 2011, PPoPP '11.

[10]  Sebastian Burckhardt,et al.  The design of a task parallel library , 2009, OOPSLA.

[11]  Yi Guo,et al.  Work-first and help-first scheduling policies for async-finish task parallelism , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[12]  Robert D. Blumofe,et al.  Executing multithreaded programs efficiently , 1995 .

[13]  Alejandro Duran,et al.  The Design of OpenMP Tasks , 2009, IEEE Transactions on Parallel and Distributed Systems.

[14]  Guy E. Blelloch,et al.  Scheduling threads for constructive cache sharing on CMPs , 2007, SPAA '07.

[15]  Doug Lea,et al.  A Java fork/join framework , 2000, JAVA '00.

[16]  Charles E. Leiserson,et al.  The Cilk++ concurrency platform , 2009, 2009 46th ACM/IEEE Design Automation Conference.

[17]  Quan Chen,et al.  CAB: Cache Aware Bi-tier Task-Stealing in Multi-socket Multi-core Architecture , 2011, 2011 International Conference on Parallel Processing.

[18]  Wenguang Chen,et al.  Maotai: View-Oriented Parallel Programming on CMT Processors , 2008, 2008 37th International Conference on Parallel Processing.

[19]  David Chase,et al.  Dynamic circular work-stealing deque , 2005, SPAA '05.

[20]  Yi Guo,et al.  SLAW: A scalable locality-aware adaptive work-stealing scheduler , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[21]  Guy E. Blelloch,et al.  Provably good multicore cache performance for divide-and-conquer algorithms , 2008, SODA '08.

[22]  Maged M. Michael,et al.  Idempotent work stealing , 2009, PPoPP '09.

[23]  Guy E. Blelloch,et al.  Low depth cache-oblivious algorithms , 2010, SPAA '10.

[24]  Frédéric Wagner,et al.  Hierarchical Work-Stealing , 2010, Euro-Par.

[25]  Nir Shavit,et al.  Non-blocking steal-half work queues , 2002, PODC '02.

[26]  James Reinders,et al.  Intel® threading building blocks , 2008 .

[27]  Guy E. Blelloch,et al.  Scheduling irregular parallel computations on hierarchical caches , 2011, SPAA '11.

[28]  Quan Chen,et al.  CATS: cache aware task-stealing based on online profiling in multi-socket multi-core architectures , 2012, ICS '12.

[29]  Yi Guo,et al.  SLAW: A scalable locality-aware adaptive work-stealing scheduler , 2010, IPDPS.

[30]  Lei Wang,et al.  An adaptive task creation strategy for work-stealing scheduling , 2010, CGO '10.

[31]  Guy E. Blelloch,et al.  The Data Locality of Work Stealing , 2002, SPAA '00.