ACES and Cray Collaborate on Advanced Power Management for Trinity

The motivation for power and energy measurement and control capabilities for High Performance Computing (HPC) systems is now well accepted by the community. While technology providers have begun to deliver some capabilities in this area, interfaces to expose these features are vendor specific. The need for a standard way to leverage these emerging capabilities, now and in the future is clear. To address this need, the Department of Energy funded an effort to produce a Power application programming interface (API) specification for High Performance Computing systems with the goal of contributing this API to the community as a proposed standard for power measurement and control. In addition to the open publication of this standard an Advanced Power Management Non-recurring Engineering project has been initiated with Cray Inc. with the intention of advancing capabilities in this area and delivering them on a leadership class platform. We will detail the collaboration established between the Alliance for Computing at Extreme Scale (Sandia Laboratories and Los Alamos Laboratory) and Cray and the portions of the Power API that have been selected for the first production implementation of the standard. Keywords-power monitoring; power control; energy efficiency; power measurement;

[1]  Martin Schulz,et al.  Stack Trace Analysis for Large Scale Debugging , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[2]  Stephen L. Olivier,et al.  Early experiences with node-level power capping on the Cray XC40 platform , 2015, E2SC '15.

[3]  Dong Li,et al.  PowerPack: Energy Profiling and Analysis of High-Performance Systems and Applications , 2010, IEEE Transactions on Parallel and Distributed Systems.

[4]  Steven Martin,et al.  Cray Advanced Platform Monitoring and Control (CAPMC) , 2015 .

[5]  Boyana Norris,et al.  WattProf: A Flexible Platform for Fine-Grained HPC Power Profiling , 2015, 2015 IEEE International Conference on Cluster Computing.

[6]  James H. Laros,et al.  Power/Energy Use Cases for High Performance Computing , 2013 .

[7]  Stephen L. Olivier,et al.  High Performance Computing - Power Application Programming Interface Specification Version 1.1a , 2016 .

[8]  Thomas W. Tucker,et al.  The Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[9]  B.P. Miller,et al.  MRNet: A Software-Based Multicast/Reduction Network for Scalable Tools , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[10]  Daniel Bedard,et al.  PowerMon: Fine-grained and integrated power monitoring for commodity computer systems , 2010, Proceedings of the IEEE SoutheastCon 2010 (SoutheastCon).

[11]  Jack J. Dongarra,et al.  A Scalable Cross-Platform Infrastructure for Application Performance Tuning Using Hardware Counters , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[12]  James H. Laros,et al.  PowerInsight - A commodity power measurement capability , 2013, 2013 International Green Computing Conference Proceedings.

[13]  Steven J. Martin Cray XC30 Power Monitoring and Management , 2014 .

[14]  Stephen L. Olivier,et al.  Enabling Advanced Operational Analysis Through Multi-subsystem Data Integration on Trinity. , 2015 .

[15]  Ron Brightwell,et al.  Re-evaluating Network Onload vs. Offload for the Many-Core Era , 2015, 2015 IEEE International Conference on Cluster Computing.