Thermal aware automated load balancing for HPC applications

As we move towards the exascale era, power and energy have become major challenges. Some of the supercomputers draw more than 10 megawatts, leading to high energy bills. A significant portion of this energy is spent in cooling. In this paper, we propose an adaptive control system that minimizes the cooling energy by using Dynamic Voltage and Frequency Scaling to control the temperature and performing load balancing. This framework, which is a part of the adaptive runtime system, monitors the system and application characteristics and triggers mechanism to limit the temperature. It also performs load balancing whenever imbalance is detected and load balancing is beneficial. We demonstrate, using a set of applications and benchmarks, that the proposed framework can control the temperature of the cores effectively and reduce the timing penalty automatically without any support from the user.

[1]  Laxmikant V. Kalé,et al.  NAMD: Biomolecular Simulation on Thousands of Processors , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[2]  George Forman,et al.  Cool Job Allocation: Measuring the Power Savings of Placing Jobs at Cooling-Efficient Locations in the Data Center , 2007, USENIX Annual Technical Conference.

[3]  Diana Marculescu,et al.  Analysis of dynamic voltage/frequency scaling in chip-multiprocessors , 2007, Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07).

[4]  Min Yeol Lim,et al.  Adaptive, transparent CPU scaling algorithms leveraging inter-node MPI communication regions , 2011, Parallel Comput..

[5]  Laxmikant V. Kalé,et al.  Automated Load Balancing Invocation Based on Application Characteristics , 2012, 2012 IEEE International Conference on Cluster Computing.

[6]  Massoud Pedram,et al.  Fine-Grained Dynamic Voltage and Frequency Scaling for Precise Energy and Performance Trade-Off Based on the Ratio of Off-Chip Access to On-Chip Computation Times , 2004, DATE.

[7]  Gregor von Laszewski,et al.  Thermal aware workload scheduling with backfilling for green data centers , 2009, 2009 IEEE 28th International Performance Computing and Communications Conference.

[8]  Kang G. Shin,et al.  Real-time dynamic voltage scaling for low-power embedded operating systems , 2001, SOSP.

[9]  David K. Lowenthal,et al.  Minimizing execution time in MPI programs on an energy-constrained, power-scalable cluster , 2006, PPoPP '06.

[10]  Laxmikant V. Kalé,et al.  "Cool" Load Balancing for High Performance Computing Data Centers , 2012, IEEE Trans. Computers.

[11]  Laxmikant V. Kalé,et al.  CHARM++: a portable concurrent object oriented system based on C++ , 1993, OOPSLA '93.

[12]  S. Huang,et al.  Energy-Efficient Cluster Computing via Accurate Workload Characterization , 2009, 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid.

[13]  Shen Li,et al.  Joint Optimization of Computing and Cooling Energy: Analytic Model and a Machine Room Case Study , 2012, 2012 IEEE 32nd International Conference on Distributed Computing Systems.

[14]  Laxmikant V. Kalé,et al.  A ‘cool’ load balancer for parallel applications , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).