CATS: cache aware task-stealing based on online profiling in multi-socket multi-core architectures

Multi-socket Multi-core architectures with shared caches in each socket have become mainstream when a single multi-core chip cannot provide enough computing capacity for high performance computing. However, traditional task-stealing schedulers tend to pollute the shared cache and incur severe cache misses due to their randomness in stealing. To address the problem, this paper proposes a Cache Aware Task-Stealing (CATS) scheduler, which uses the shared cache efficiently with an online profiling method and schedules tasks with shared data to the same socket. CATS adopts an online DAG partitioner based on the profiling information to ensure tasks with shared data can efficiently utilize the shared cache. One outstanding novelty of CATS is that it does not require any extra user-provided information. Experimental results show that CATS can improve the performance of memory-bound programs up to 74.4% compared with the traditional task-stealing scheduler.

[1]  Tianzhou Chen,et al.  Less reused filter: improving l2 cache performance via filtering less reused lines , 2009, ICS '09.

[2]  Frédéric Wagner,et al.  Hierarchical Work-Stealing , 2010, Euro-Par.

[3]  Swann Perarnau,et al.  Controlling cache utilization of HPC applications , 2011, ICS '11.

[4]  Nir Shavit,et al.  Non-blocking steal-half work queues , 2002, PODC '02.

[5]  David Chase,et al.  Dynamic circular work-stealing deque , 2005, SPAA '05.

[6]  James Reinders,et al.  Intel® threading building blocks , 2008 .

[7]  Stephen L. Olivier,et al.  Scheduling task parallelism on multi-socket multicore systems , 2011, ROSS '11.

[8]  Robert D. Blumofe,et al.  Executing multithreaded programs efficiently , 1995 .

[9]  Doug Lea,et al.  A Java fork/join framework , 2000, JAVA '00.

[10]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[11]  Jens Palsberg,et al.  Featherweight X10: a core calculus for async-finish parallelism , 2010, PPoPP '10.

[12]  Michael Stumm,et al.  Online performance analysis by statistical sampling of microprocessor performance counters , 2005, ICS '05.

[13]  Charles E. Leiserson,et al.  The Cilk++ concurrency platform , 2009, 2009 46th ACM/IEEE Design Automation Conference.

[14]  M. Berger,et al.  Adaptive mesh refinement for hyperbolic partial differential equations , 1982 .

[15]  Guy E. Blelloch,et al.  Provably good multicore cache performance for divide-and-conquer algorithms , 2008, SODA '08.

[16]  Matteo Frigo,et al.  Cache-oblivious algorithms , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[17]  Hans-Peter Seidel,et al.  Cache oblivious parallelograms in iterative stencil computations , 2010, ICS '10.

[18]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[19]  Guy E. Blelloch,et al.  The data locality of work stealing , 2000, SPAA.

[20]  Michael Stumm,et al.  RapidMRC: approximating L2 miss rate curves on commodity systems for online optimizations , 2009, ASPLOS.

[21]  Guy E. Blelloch,et al.  Low depth cache-oblivious algorithms , 2010, SPAA '10.

[22]  Guy E. Blelloch,et al.  Scheduling threads for constructive cache sharing on CMPs , 2007, SPAA '07.

[23]  Chia-Lin Yang,et al.  Cache-aware task scheduling on multi-core architecture , 2010, Proceedings of 2010 International Symposium on VLSI Design, Automation and Test.

[24]  Tao Yang,et al.  A Comparison of Clustering Heuristics for Scheduling Directed Acycle Graphs on Multiprocessors , 1992, J. Parallel Distributed Comput..

[25]  Wenguang Chen,et al.  Maotai: View-Oriented Parallel Programming on CMT Processors , 2008, 2008 37th International Conference on Parallel Processing.

[26]  Quan Chen,et al.  CAB: Cache Aware Bi-tier Task-Stealing in Multi-socket Multi-core Architecture , 2011, 2011 International Conference on Parallel Processing.

[27]  Quan Chen,et al.  WATS: Workload-Aware Task Scheduling in Asymmetric Multi-core Architectures , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[28]  Richard Cole,et al.  Analysis of Randomized Work Stealing with False Sharing , 2011, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[29]  Alejandro Duran,et al.  The Design of OpenMP Tasks , 2009, IEEE Transactions on Parallel and Distributed Systems.

[30]  Yi Guo,et al.  SLAW: A scalable locality-aware adaptive work-stealing scheduler , 2010, IPDPS.

[31]  Lei Wang,et al.  An adaptive task creation strategy for work-stealing scheduling , 2010, CGO '10.

[32]  Matteo Frigo,et al.  The implementation of the Cilk-5 multithreaded language , 1998, PLDI.

[33]  Yi Guo,et al.  Work-first and help-first scheduling policies for async-finish task parallelism , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[34]  Maged M. Michael,et al.  Idempotent work stealing , 2009, PPoPP '09.

[35]  Guy E. Blelloch,et al.  Scheduling irregular parallel computations on hierarchical caches , 2011, SPAA '11.

[36]  Mark Moir,et al.  A dynamic-sized nonblocking work stealing deque , 2006, Distributed Computing.

[37]  David R. Butenhof Programming with POSIX threads , 1993 .

[38]  Xiaoning Ding,et al.  ULCC: a user-level facility for optimizing shared cache performance on multicores , 2011, PPoPP '11.