Analyzing Resource Trade-offs in Hardware Overprovisioned Supercomputers

Hardware overprovisioned systems have recently been proposed as a viable alternative for a power-efficient design of next-generation supercomputers. A key challenge for such systems is to determine the degree of overprovisioning, which refers to the number of extra nodes that need to be installed under a given power constraint. In this paper, we first show that the degree of overprovisioning depends on dynamic parameters, such as the job mix as well as the global power constraint, and that static decisions can result in limited system throughput. We then study an exhaustive combination of adaptive resource management strategies that span three job scheduling algorithms, four power capping techniques, and three node boot-up mechanisms to understand the trade-off space involved. We then draw conclusions about how these strategies can adaptively control the degree of overprovisioning and analyze their impact on job throughput and power utilization.

[1]  Laxmikant V. Kalé,et al.  Energy-efficient computing for HPC workloads on heterogeneous manycore chips , 2015, PMAM@PPoPP.

[2]  Xu Yang,et al.  A Data Driven Scheduling Approach for Power Management on HPC Systems , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[3]  Frank Mueller,et al.  Power tuning HPC jobs on power-constrained systems , 2016, 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT).

[4]  Yiannis Georgiou,et al.  Energy Accounting and Control with SLURM Resource and Job Management System , 2014, ICDCN.

[5]  Zhiling Lan,et al.  Reducing Energy Costs for IBM Blue Gene/P via Power-Aware Job Scheduling , 2013, JSSPP.

[6]  Yuan He,et al.  Demand-Aware Power Management for Power-Constrained HPC Systems , 2016, 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid).

[7]  Jie Li,et al.  Towards Optimal Electric Demand Management for Internet Data Centers , 2012, IEEE Transactions on Smart Grid.

[8]  Karthick Rajamani,et al.  A performance-conserving approach for reducing peak power consumption in server systems , 2005, ICS '05.

[9]  Martin Schulz,et al.  Economic Viability of Hardware Overprovisioning in Power-Constrained High Performance Computing , 2016, 2016 4th International Workshop on Energy Efficient Supercomputing (E2SC).

[10]  Deva Bodas,et al.  Simple Power-Aware Scheduler to Limit Power Consumption by HPC System within a Budget , 2014, 2014 Energy Efficient Supercomputing Workshop.

[11]  Xu Yang,et al.  Integrating dynamic pricing of electricity into energy aware scheduling for HPC systems , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[12]  Jordi Torres,et al.  Towards energy-aware scheduling in data centers using machine learning , 2010, e-Energy.

[13]  Fuat Keceli,et al.  Global Extensible Open Power Manager: A Vehicle for HPC Community Collaboration on Co-Designed Energy Management Solutions , 2017, ISC.

[14]  Laxmikant V. Kalé,et al.  Optimizing power allocation to CPU and memory subsystems in overprovisioned HPC systems , 2013, 2013 IEEE International Conference on Cluster Computing (CLUSTER).

[15]  Martin Schulz,et al.  A Unified Platform for Exploring Power Management Strategies , 2016, 2016 4th International Workshop on Energy Efficient Supercomputing (E2SC).

[16]  Andrea Bartolini,et al.  MS3: A Mediterranean-stile job scheduler for supercomputers - do less when it's too hot! , 2015, 2015 International Conference on High Performance Computing & Simulation (HPCS).

[17]  Martin Schulz,et al.  Dynamic power sharing for higher job throughput , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[18]  Martin Schulz,et al.  Beyond DVFS: A First Look at Performance under a Hardware-Enforced Power Bound , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[19]  Mateo Valero,et al.  Utilization driven power-aware parallel job scheduling , 2010, Computer Science - Research and Development.

[20]  Thomas W. Tucker,et al.  The Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[21]  Mateo Valero,et al.  Parallel job scheduling for power constrained HPC systems , 2012, Parallel Comput..

[22]  Martin Schulz,et al.  Production Hardware Overprovisioning: Real-World Performance Optimization Using an Extensible Power-Aware Resource Management Framework , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[23]  Alan J. Weger,et al.  Power management of multi-core chips: Challenges and pitfalls , 2012, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[24]  Dhabaleswar K. Panda,et al.  A case for application-oblivious energy-efficient MPI runtime , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[25]  Laxmikant V. Kalé,et al.  Maximizing Throughput of Overprovisioned HPC Data Centers Under a Strict Power Budget , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[26]  Martin Schulz,et al.  Systemwide Power Management with Argo , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[27]  Martin Schulz,et al.  Practical Resource Management in Power-Constrained, High Performance Computing , 2015, HPDC.

[28]  Martin Schulz,et al.  Exploring hardware overprovisioning in power-constrained, high performance computing , 2013, ICS '13.

[29]  Satoshi Matsuoka The TSUBAME2.5 Evolution Molecular Dynamics Simulation Accelerated by GPU for GPCR with a non-Ewald Algorithm Large-scale Parallel Iterated Local Search Algorithm for Traveling Salesman Problem , 2013 .

[30]  Luca Benini,et al.  Predictive Modeling for Job Power Consumption in HPC Systems , 2016, ISC.

[31]  Martin Schulz,et al.  POW: System-wide Dynamic Reallocation of Limited Power in HPC , 2015, HPDC.

[32]  Honbo Zhou,et al.  The EASY - LoadLeveler API Project , 1996, JSSPP.

[33]  Liang Liu,et al.  GreenCloud: a new architecture for green data center , 2009, ICAC-INDST '09.