Minimizing Thermal Variation Across System Components

Thermal overheating is a serious concern in modern supercomputing systems. Elevated temperature levels reduce the reliability and the lifetime of the underlying hardware and increase their power consumption. Previous studies on mitigating thermal hotspots at the hardware and run-time system levels have typically used approaches that trade off performance for reduced operating temperatures. In this paper, we first show that in a large-scale system, physical attributes cause an uneven temperature distribution. We then develop a model to characterize the thermal behaviour of a complex system using various machine learning methods. We propose to improve application placement by incorporating thermal awareness into the decision-making process. Specifically, our system predicts the thermal condition of the system based on application mapping and uses these predictions to mitigate thermal hotspots without any performance loss. We provide two versions of our prediction mechanism. On a two-node configuration, these models achieve 72.5% and 78.8% success rates in their predictions, respectively. In other words, the scheduling decisions of our models result in a task placement that has a lower maximum average temperature. Overall, the more aggressive scheme reduces the average peak temperature by up to 11.9°C (2.3°C on average) without any performance degradation.

[1]  Ricardo Bianchini,et al.  C-Oracle: Predictive thermal management for data centers , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[2]  Li Shang,et al.  HybDTM: a coordinated hardware-software approach for dynamic thermal management , 2006, 2006 43rd ACM/IEEE Design Automation Conference.

[3]  Alan J. Weger,et al.  Thermal-aware task scheduling at the system software level , 2007, Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07).

[4]  Susan Coghlan,et al.  Argonne applications for the IBM Blue Gene/Q, Mira , 2013, IBM J. Res. Dev..

[5]  Laxmikant V. Kalé,et al.  Temperature Aware Load Balancing for Parallel Applications: Preliminary Work , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[6]  Christopher K. I. Williams,et al.  Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning) , 2005 .

[7]  Sherief Reda,et al.  Thermal prediction and adaptive control through workload phase detection , 2013, TODE.

[8]  Jeffrey S. Chase,et al.  Making Scheduling "Cool": Temperature-Aware Workload Placement in Data Centers , 2005, USENIX Annual Technical Conference, General Track.

[9]  Xi He,et al.  Towards Thermal Aware Workload Scheduling in a Data Center , 2009, 2009 10th International Symposium on Pervasive Systems, Algorithms, and Networks.

[10]  Jeffrey S. Chase,et al.  Weatherman: Automated, Online and Predictive Thermal Mapping and Management for Data Centers , 2006, 2006 IEEE International Conference on Autonomic Computing.

[11]  Kevin Skadron,et al.  Predictive Temperature-Aware DVFS , 2010, IEEE Transactions on Computers.

[12]  Laxmikant V. Kalé,et al.  A ‘cool’ load balancer for parallel applications , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[13]  Naehyuck Chang,et al.  Exploiting Application/System-Dependent Ambient Temperature for Accurate Microarchitectural Simulation , 2013, IEEE Trans. Computers.

[14]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[15]  Sandeep K. S. Gupta,et al.  Software Architecture for Dynamic Thermal Management in Datacenters , 2007, 2007 2nd International Conference on Communication Systems Software and Middleware.