Introducing Application Awareness Into a Unified Power Management Stack

Effective power management in a data center is critical to ensure that power delivery constraints are met while maximizing the performance of users’ workloads. Power limiting is needed in order to respond to greater-than-expected power demand. HPC sites have generally tackled this by adopting one of two approaches: (1) a system-level power management approach that is aware of the facility or site-level power requirements, but is agnostic to the application demands; OR (2) a job-level power management solution that is aware of the application design patterns and requirements, but is agnostic to the site-level power constraints. Simultaneously incorporating solutions from both domains often leads to conflicts in power management mechanisms. This, in turn, affects system stability and leads to irreproducibility of performance. To avoid this irreproducibility, HPC sites have to choose between one of the two approaches, thereby leading to missed opportunities for efficiency gains.This paper demonstrates the need for the HPC community to collaborate towards seamless integration of system-aware and application-aware power management approaches. This is achieved by proposing a new dynamic policy that inherits the benefits of both approaches from tight integration of a resource manager and a performance-aware job runtime environment. An empirical comparison of this integrated management approach against state-of-the-art solutions exposes the benefits of investing in end-to-end solutions to optimize for system-wide performance or efficiency objectives. With our proposed system–application integrated policy, we observed up to 7% reduction in system time dedicated to jobs and up to 11% savings in compute energy, compared to a baseline that is agnostic to system power and application design constraints.

[1]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[2]  Frank Mueller,et al.  PShifter: feedback-based dynamic power shifting within HPC jobs for performance , 2018, HPDC.

[3]  Richard W. Vuduc,et al.  A Roofline Model of Energy , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[4]  Henry Hoffmann,et al.  Maximizing Performance Under a Power Cap: A Comparison of Hardware, Software, and Hybrid Techniques , 2016, ASPLOS.

[5]  Marco Danelutto,et al.  Application-Aware Power Capping Using Nornir , 2019, PPAM.

[6]  Yijia Zhang,et al.  An empirical survey of performance and energy efficiency variation on Intel processors , 2017, E2SC@SC.

[7]  Fuat Keceli,et al.  Global Extensible Open Power Manager: A Vehicle for HPC Community Collaboration on Co-Designed Energy Management Solutions , 2017, ISC.

[8]  Bin Li,et al.  Dynamo: Facebook's Data Center-Wide Power Management System , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[9]  Henry Hoffmann,et al.  Adapt&Cap: Coordinating System- and Application-Level Adaptation for Power-Constrained Systems , 2016, IEEE Design & Test.

[10]  Martin Schulz,et al.  POW: System-wide Dynamic Reallocation of Limited Power in HPC , 2015, HPDC.

[11]  Gregory A. Koenig,et al.  Energy and Power Aware Job Scheduling and Resource Management: Global Survey — Initial Analysis , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[12]  Masaaki Kondo,et al.  A Strawman for an HPC PowerStack , 2018 .

[13]  Martin Schulz,et al.  Exploring hardware overprovisioning in power-constrained, high performance computing , 2013, ICS '13.

[14]  Rahul Khanna,et al.  RAPL: Memory power estimation and capping , 2010, 2010 ACM/IEEE International Symposium on Low-Power Electronics and Design (ISLPED).