WATS: Workload-Aware Task Scheduling in Asymmetric Multi-core Architectures

Asymmetric Multi-Core (AMC) architectures have shown high performance as well as power efficiency. However, current parallel programming environments do not perform well on AMC due to their assumption that all cores are symmetric and provide equal performance. Their random task scheduling policies, such as task-stealing, can result in unbalanced workloads in AMC and severely degrade the performance of parallel applications. To balance the workloads of parallel applications in AMC, this paper proposes a Workload-Aware Task Scheduling (WATS) scheme that adopts history-based task allocation and preference-based task stealing. The history-based task allocation is based on a near-optimal, static task allocation using the historical statistics collected during the execution of a parallel application. The preference-based task stealing, which steals tasks based on a preference list, can dynamically adjust the workloads in AMC if the task allocation is less optimal due to approximation in the history-based task allocation. Experimental results show that WATS can improve the performance of CPU-bound applications up to 82.7% compared with the random task scheduling policies.

[1]  Alejandro Duran,et al.  The Design of OpenMP Tasks , 2009, IEEE Transactions on Parallel and Distributed Systems.

[2]  Minyi Guo,et al.  Design and implementation of stream processing system and library for CELL broadband engine processors , 2007 .

[3]  Egon Balas,et al.  The Shifting Bottleneck Procedure for Job Shop Scheduling , 1988 .

[4]  Vivek Sarkar,et al.  Communication Optimizations for Distributed-Memory X10 Programs , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[5]  Michael A. Bender,et al.  Scheduling Cilk multithreaded parallel programs on processors of different speeds , 2000, SPAA.

[6]  Sandhya Dwarkadas,et al.  Compatible phase co-scheduling on a CMP of multi-threaded processors , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[7]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[8]  Omer Khan,et al.  A self-adaptive scheduler for asymmetric multi-cores , 2010, GLSVLSI '10.

[9]  David Grove,et al.  X10 as a Parallel Language for Scientific Computation: Practice and Experience , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[10]  Serge Miguet,et al.  Heuristics for 1D Rectilinear Partitioning as a Low Cost and High Quality Answer to Dynamic Load Balancing , 1997, HPCN Europe.

[11]  Jane W.-S. Liu,et al.  Bounds on Scheduling Algorithms for Heterogeneous Comnputing Systems , 1974, IFIP Congress.

[12]  Arnold L. Rosenberg,et al.  Toward understanding heterogeneity in computing , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[13]  Minyi Guo,et al.  A taxonomy of application scheduling tools for high performance cluster computing , 2006, Cluster Computing.

[14]  Norman P. Jouppi,et al.  Single-ISA heterogeneous multi-core architectures for multithreaded workload performance , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[15]  Uri C. Weiser,et al.  Scheduling Multiple Multithreaded Applications on Asymmetric and Symmetric Chip Multiprocessors , 2010, 2010 3rd International Symposium on Parallel Architectures, Algorithms and Programming.

[16]  Norman P. Jouppi,et al.  Heterogeneous chip multiprocessors , 2005, Computer.

[17]  Sebastian Burckhardt,et al.  The design of a task parallel library , 2009, OOPSLA.

[18]  Charles E. Leiserson,et al.  The Cilk++ concurrency platform , 2009, 2009 46th ACM/IEEE Design Automation Conference.

[19]  Yi Guo,et al.  SLAW: A scalable locality-aware adaptive work-stealing scheduler , 2010, IPDPS.

[20]  Matteo Frigo,et al.  The implementation of the Cilk-5 multithreaded language , 1998, PLDI.

[21]  David B. Shmoys,et al.  A Polynomial Approximation Scheme for Scheduling on Uniform Processors: Using the Dual Approximation Approach , 1988, SIAM J. Comput..

[22]  Yi Guo,et al.  Work-first and help-first scheduling policies for async-finish task parallelism , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[23]  Sally A. McKee,et al.  An approach to resource-aware co-scheduling for CMPs , 2010, ICS '10.

[24]  Doug Lea,et al.  A Java fork/join framework , 2000, JAVA '00.

[25]  Tong Li,et al.  Efficient operating system scheduling for performance-asymmetric multi-core architectures , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[26]  Sriram Krishnamoorthy,et al.  Lifeline-based global load balancing , 2011, PPoPP '11.

[27]  Jens Palsberg,et al.  Featherweight X10: a core calculus for async-finish parallelism , 2010, PPoPP '10.

[28]  Dean M. Tullsen,et al.  Exploiting unbalanced thread scheduling for energy and performance on a CMP of SMT processors , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[29]  Quan Chen,et al.  CAB: Cache Aware Bi-tier Task-Stealing in Multi-socket Multi-core Architecture , 2011, 2011 International Conference on Parallel Processing.

[30]  James Reinders,et al.  Intel® threading building blocks , 2008 .

[31]  Stacey Jeffery,et al.  HASS: a scheduler for heterogeneous multicore systems , 2009, OPSR.

[32]  Yi Guo,et al.  SLAW: A scalable locality-aware adaptive work-stealing scheduler , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[33]  Hyesoon Kim,et al.  Age based scheduling for asymmetric multiprocessors , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[34]  Norman P. Jouppi,et al.  Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction , 2003, MICRO.

[35]  Quan Chen,et al.  HAT: history-based auto-tuning MapReduce in heterogeneous environments , 2013, The Journal of Supercomputing.

[36]  Mark D. Hill,et al.  Amdahl's Law in the Multicore Era , 2008, Computer.

[37]  Dheeraj Reddy,et al.  Bias scheduling in heterogeneous multi-core architectures , 2010, EuroSys '10.

[38]  Ravi Rajwar,et al.  The impact of performance asymmetry in emerging multicore architectures , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[39]  Onur Mutlu,et al.  Accelerating critical section execution with asymmetric multi-core architectures , 2009, ASPLOS.

[40]  Steven A. Hofmeyr,et al.  Load balancing on speed , 2010, PPoPP '10.