What does Power Consumption Behavior of HPC Jobs Reveal? : Demystifying, Quantifying, and Predicting Power Consumption Characteristics

As we approach exascale computing, large-scale HPC systems are becoming increasingly power-constrained, requiring them to run HPC workloads in an energy-efficient manner. The first step toward achieving this goal is to better understand, analyze, and quantify the power consumption characteristics of HPC jobs. However, there is a lack of understanding of the power consumption characteristics of HPC jobs which run on production HPC systems. Such characterization is required to guide the design of the next generation of power-aware resource management. To the best of our knowledge, we are the first study to open-source the data and analysis of power-consumption characteristics of HPC jobs and users from two medium-scale production HPC clusters.

[1]  Laurent Lefèvre,et al.  Towards Energy Budget Control in HPC , 2017, 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[2]  Vipin Chaudhary,et al.  Rack aware scheduling in HPC data centers: an energy conservation strategy , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[3]  Frank Mueller,et al.  Power tuning HPC jobs on power-constrained systems , 2016, 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT).

[4]  Yong Meng Teo,et al.  An Approach for Energy Efficient Execution of Hybrid Parallel Programs , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[5]  Martin Schulz,et al.  Practical Resource Management in Power-Constrained, High Performance Computing , 2015, HPDC.

[6]  Cécile Belleudy,et al.  Efficiency Modeling and Analysis of 64-bit ARM Clusters for HPC , 2016, 2016 Euromicro Conference on Digital System Design (DSD).

[7]  Tirthak Patel,et al.  PERQ: Fair and Efficient Power Management of Power-Constrained Large-Scale Computing Systems , 2019, HPDC.

[8]  Scott Pakin,et al.  Characterizing and Modeling Power and Energy for Extreme-Scale In-Situ Visualization , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[9]  Michael Lang,et al.  Trapped Capacity: Scheduling under a Power Cap to Maximize Machine-Room Throughput , 2014, 2014 Energy Efficient Supercomputing Workshop.

[10]  Frank Mueller,et al.  PShifter: feedback-based dynamic power shifting within HPC jobs for performance , 2018, HPDC.

[11]  Bin Nie,et al.  Characterizing Temperature, Power, and Soft-Error Behaviors in Data Center Systems: Insights, Challenges, and Opportunities , 2017, 2017 IEEE 25th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS).

[12]  Kolin Paul,et al.  Self-assembly: a review of scope and applications. , 2015, IET nanobiotechnology.

[13]  Joseph Emeras,et al.  Energy Model for Low-Power Cluster , 2017, 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[14]  Yuichi Inadomi,et al.  Analyzing and mitigating the impact of manufacturing variability in power-constrained supercomputing , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[15]  Parosh Aziz Abdulla,et al.  Portable, Scalable, per-Core Power Estimation for Intelligent Resource Management , 2012 .

[16]  Romain Rouvoy,et al.  WattsKit: Software-Defined Power Monitoring of Distributed Systems , 2017, 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[17]  Ping Huang,et al.  Power-Capping Aware Checkpointing: On the Interplay Among Power-Capping, Temperature, Reliability, Performance, and Energy , 2016, 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[18]  Martin Schulz,et al.  Dynamic power sharing for higher job throughput , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[19]  Mikko Majanen,et al.  Energy-aware job scheduler for high-performance computing , 2012, Computer Science - Research and Development.

[20]  Peter Desnoyers,et al.  Active flash: towards energy-efficient, in-situ data analytics on extreme-scale machines , 2013, FAST.

[21]  Mahidhar Tatineni,et al.  Trestles: a high-productivity HPC system targeted to modest-scale and gateway users , 2011 .

[22]  Christopher Stewart,et al.  Adaptive Power Profiling for Many-Core HPC Architectures , 2016, 2016 IEEE International Conference on Autonomic Computing (ICAC).

[23]  Michael Lang,et al.  Power usage of production supercomputers and production workloads , 2016, Concurr. Comput. Pract. Exp..

[24]  Jordi Torres,et al.  Towards energy-aware scheduling in data centers using machine learning , 2010, e-Energy.

[25]  Xu Yang,et al.  A Data Driven Scheduling Approach for Power Management on HPC Systems , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[26]  Rong Ge,et al.  Improvement of power-performance efficiency for high-end computing , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[27]  Yuan He,et al.  Demand-Aware Power Management for Power-Constrained HPC Systems , 2016, 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid).

[28]  Jesús Labarta,et al.  Automatic Phase Detection of MPI Applications , 2007, PARCO.

[29]  Dan Tsafrir,et al.  Experience with using the Parallel Workloads Archive , 2014, J. Parallel Distributed Comput..

[30]  Allan Porterfield,et al.  An Adaptive Core-Specific Runtime for Energy Efficiency , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[31]  Deva Bodas,et al.  Simple Power-Aware Scheduler to Limit Power Consumption by HPC System within a Budget , 2014, 2014 Energy Efficient Supercomputing Workshop.

[32]  Martin Schulz,et al.  Finding the limits of power-constrained application performance , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[33]  Martin Schulz,et al.  ScalaTrace: Scalable compression and replay of communication traces for high-performance computing , 2008, J. Parallel Distributed Comput..

[34]  Xingfu Wu,et al.  Using Performance-Power Modeling to Improve Energy Efficiency of HPC Applications , 2016, Computer.

[35]  Franck Cappello,et al.  Reducing Waste in Extreme Scale Systems through Introspective Analysis , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[36]  Weisong Shi,et al.  Application configuration selection for energy-efficient execution on multicore systems , 2016, J. Parallel Distributed Comput..

[37]  Yang Liu,et al.  An introduction to decision tree modeling , 2004 .

[38]  Xiaorui Wang,et al.  Power capping: a prelude to power shifting , 2008, Cluster Computing.

[39]  Martin Schulz,et al.  A Run-Time System for Power-Constrained HPC Applications , 2015, ISC.

[40]  Li Shen,et al.  Co-Run Scheduling with Power Cap on Integrated CPU-GPU Systems , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[41]  Laxmikant V. Kalé,et al.  Maximizing Throughput of Overprovisioned HPC Data Centers Under a Strict Power Budget , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[42]  Mateo Valero,et al.  Optimizing job performance under a given power constraint in HPC centers , 2010, International Conference on Green Computing.

[43]  Marco Danelutto,et al.  SKElib : Parallel Programming with Skeletons in C , 2000, Euro-Par.

[44]  Martin Schulz,et al.  Production Hardware Overprovisioning: Real-World Performance Optimization Using an Extensible Power-Aware Resource Management Framework , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[45]  Garth A. Gibson,et al.  The Computer Failure Data Repository ( CFDR ) , 2006 .

[46]  Martin Schulz,et al.  Exploring hardware overprovisioning in power-constrained, high performance computing , 2013, ICS '13.

[47]  Christopher J. Hughes,et al.  Performance evaluation of Intel® Transactional Synchronization Extensions for high-performance computing , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[48]  P. H. Carns ALCF I/O Data Repository , 2013 .

[49]  Thomas Scogland,et al.  Node variability in large-scale power measurements: perspectives from the Green500, Top500 and EEHPCWG , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[50]  Shuaiwen Song,et al.  Investigating the Interplay between Energy Efficiency and Resilience in High Performance Computing , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[51]  Manish Parashar,et al.  Investigating the potential of application-centric aggressive power management for HPC workloads , 2010, 2010 International Conference on High Performance Computing.

[52]  Martin Schulz,et al.  POW: System-wide Dynamic Reallocation of Limited Power in HPC , 2015, HPDC.

[53]  Thomas Ilsche,et al.  The shift from processor power consumption to performance variations: fundamental implications at scale , 2016, Computer Science - Research and Development.

[54]  Thomas F. Wenisch,et al.  Power routing: dynamic power provisioning in the data center , 2010, ASPLOS XV.

[55]  Witawas Srisa-an,et al.  Energy-Efficient I/O Thread Schedulers for NVMe SSDs on NUMA , 2017, 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[56]  Sally A. McKee,et al.  Portable, scalable, per-core power estimation for intelligent resource management , 2010, International Conference on Green Computing.

[57]  Torsten Wilde,et al.  Power variation aware Configuration Adviser for scalable HPC schedulers , 2015, 2015 International Conference on High Performance Computing & Simulation (HPCS).

[58]  Yoonho Park,et al.  Power Aware Heterogeneous Node Assembly , 2019, 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[59]  Steven M. Gallo,et al.  A Workload Analysis of NSF's Innovative HPC Resources Using XDMoD , 2018, ArXiv.