Energy aware scheduling model and online heuristics for stencil codes on heterogeneous computing architectures

Performance of high-end supercomputers will reach the exascale through the advent of core counts in billions. However, in the upcoming exascale computing era it is important not only to focus on the performance, but also on scalability of fine-grained parallel applications, data locality and energy aware scheduling within the parallel code. In fact, parallel applications need to change even now by redesigning algorithms and data structures respectively to take advantage of the recent improvements in energy efficiency of heterogeneous computing hardware, including multicore processors and GPU accelerators. Over the next few years one of the biggest challenges for exascale will be the ability of parallel applications to fully exploit locality which will, in turn, be required to achieve expected performance and energy efficiency. Future highly parallel applications will have to deal with deep memory hierarchies taking into account energy cost in moving data off-chip. Therefore, they will have to apply new coordinated scheduling approaches to balance energy aware resource utilization and minimize work starvation during runtime. As new constraints and limits on memory bandwidth and energy will play a key role in high performance computing (HPC) in the future, more sophisticated and dynamic scheduling techniques will be needed and applied within the parallel code. In this paper we focus on an energy-aware distribution of the stencil workload on heterogeneous processors. Our analysis of energy and performance models focused on relevant class of stencil computations to explore the relationship between task scheduling algorithms and energy constraints. More precisely, we search for a schedule which minimizes the energy usage within a specified computation’s deadline of the stencil workload on heterogeneous architectures. Since the problem is computationally intractable, we present an integer linear programming formulation for finding optimal schedules. As finding optimal schedules is time consuming we have developed four heuristics and tested them experimentally with respect to optimal solutions. In our work we focus on a single node configurations with heterogeneous processors. These configurations represent the state of the art multi- and many-core architectures.

[1]  Luís Fabrício Wanderley Góes,et al.  PSkel: A stencil programming framework for CPU‐GPU systems , 2015, Concurr. Comput. Pract. Exp..

[2]  Gerhard Wellein,et al.  Efficient multicore-aware parallelization strategies for iterative stencil computations , 2010, J. Comput. Sci..

[3]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[4]  Daniel Junglas,et al.  Optimised grid-partitioning for block structured grids in parallel computing , 2007 .

[5]  Gerhard Wellein,et al.  Exploring performance and power properties of modern multi‐core chips via simple machine models , 2012, Concurr. Comput. Pract. Exp..

[6]  Qingshan Jiang,et al.  A comparative study on resource allocation and energy efficient job scheduling strategies in large-scale parallel computing systems , 2014, Cluster Computing.

[7]  Samee Ullah Khan,et al.  Power-aware resource allocation in computer clusters using dynamic threshold voltage scaling and dynamic voltage scaling: comparison and analysis , 2015, Cluster Computing.

[8]  Richard W. Vuduc,et al.  A Roofline Model of Energy , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[9]  Volker Strumpen,et al.  Cache oblivious stencil computations , 2005, ICS '05.

[10]  Ian Holyer,et al.  The NP-Completeness of Edge-Coloring , 1981, SIAM J. Comput..

[11]  Krzysztof Kurowski,et al.  Evaluation of Selected Resource Allocation and Scheduling Methods in Heterogeneous Many-Core Processors and Graphics Processing Units , 2014 .

[12]  John Shalf,et al.  Exascale Computing Technology Challenges , 2010, VECPAR.

[13]  Michal Kierzynka,et al.  From physics model to results: An optimizing framework for cross-architecture code generation , 2013 .

[14]  Pawel Gepner,et al.  Benchmarking Data and Compute Intensive Applications on Modern CPU and GPU Architectures , 2012, ICCS.

[15]  David S. Johnson,et al.  Some simplified NP-complete problems , 1974, STOC '74.

[16]  Georg Hager,et al.  Introducing a Performance Model for Bandwidth-Limited Loop Kernels , 2009, PPAM.

[17]  Krzysztof Kurowski,et al.  Methods to Load Balance a GCR Pressure Solver Using a Stencil Framework on Multi- and Many-Core Architectures , 2015, Sci. Program..

[18]  Krzysztof Kurowski,et al.  Resource management strategies with energy profiles for stencil computing , 2015 .

[19]  Naixue Xiong,et al.  Energy cost evaluation of parallel algorithms for multiprocessor systems , 2011, Cluster Computing.

[20]  Siddhartha Chatterjee,et al.  Cache-Efficient Multigrid Algorithms , 2004, Int. J. High Perform. Comput. Appl..

[21]  Satoshi Matsuoka,et al.  Physis: An implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[22]  Pradeep Dubey,et al.  3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[23]  V. G. Vizing The chromatic class of a multigraph , 1965 .

[24]  Pawel Gepner,et al.  Benchmarking JPEG 2000 implementations on modern CPU and GPU architectures , 2014, J. Comput. Sci..

[25]  Jan Weglarz,et al.  Hierarchical scheduling strategies for parallel tasks and advance reservations in grids , 2013, J. Sched..

[26]  Lukasz Szustak,et al.  Adaptation of fluid model EULAG to graphics processing unit architecture , 2015, Concurr. Comput. Pract. Exp..

[27]  Helen D. Karatza,et al.  Power-aware Bag-of-Tasks scheduling on heterogeneous platforms , 2016, Cluster Computing.

[28]  Samuel Williams,et al.  Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors , 2007, SIAM Rev..

[29]  Kenli Li,et al.  Energy-aware task scheduling in heterogeneous computing environments , 2014, Cluster Computing.

[30]  C. Shannon A Theorem on Coloring the Lines of a Network , 1949 .

[31]  Béla Bollobás,et al.  Modern Graph Theory , 2002, Graduate Texts in Mathematics.