Thermal benchmarking and modeling for HPC using big data applications

Abstract Characterizing thermal profiles of cluster nodes is an integral part of any approach that addresses thermal emergencies in a data center. Most existing thermal models make use of CPU utilization to estimate power consumption, which in turn facilitates outlet-temperature predictions. Such utilization-based thermal models may introduce errors due to inaccurate mappings from system utilization to outlet temperatures. To address this concern in the existing models, we eliminate utilization models as a middleman from the thermal model. In this paper, we propose a thermal model, tModel, that projects outlet temperatures from inlet temperatures as well as directly measured multicore temperatures rather than deploying a utilization model. In the first phase of this work, we perform extensive experimentation by varying applications types, their input data sizes, and cluster sizes. Simultaneously, we collect inlet, outlet, and multicore temperatures of cluster nodes running these diverse bigdata applications. The proposed thermal model estimates the outlet air temperature of the nodes to predict cooling costs. We validate the accuracy of our model against data gathered by thermal sensors in our cluster. Our results demonstrate that tModel estimates outlet temperatures of the cluster nodes with much higher accuracy over CPU-utilization based models. We further show that tModel is conducive of estimating the cooling cost of data centers using the predicted outlet temperatures.

[1]  Klara Nahrstedt,et al.  T*: A data-centric cooling energy costs reduction approach for Big Data analytics cloud , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[2]  Ricardo Bianchini,et al.  C-Oracle: Predictive thermal management for data centers , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[3]  Sandeep K. S. Gupta,et al.  Energy-Efficient Thermal-Aware Task Scheduling for Homogeneous High-Performance Computing Data Centers: A Cyber-Physical Approach , 2008, IEEE Transactions on Parallel and Distributed Systems.

[4]  Shen Li,et al.  Joint Optimization of Computing and Cooling Energy: Analytic Model and a Machine Room Case Study , 2012, 2012 IEEE 32nd International Conference on Distributed Computing Systems.

[5]  Massoud Pedram,et al.  Minimizing data center cooling and server power costs , 2009, ISLPED.

[6]  Ayan Banerjee,et al.  Energy Efficiency of Thermal-Aware Job Scheduling Algorithms under Various Cooling Models , 2009, IC3.

[7]  Martin Schulz,et al.  Beyond DVFS: A First Look at Performance under a Hardware-Enforced Power Bound , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[8]  Athanasios V. Vasilakos,et al.  Thermal-Aware Scheduling of Batch Jobs in Geographically Distributed Data Centers , 2014, IEEE Transactions on Cloud Computing.

[9]  Wolfgang Schott,et al.  Thermal-aware workload scheduling for energy efficient data centers , 2010, ICAC '10.

[10]  Qinghui Tang,et al.  Sensor-Based Fast Thermal Evaluation Model For Energy Efficient High-Performance Datacenters , 2006, 2006 Fourth International Conference on Intelligent Sensing and Information Processing.

[11]  Jie Huang,et al.  The HiBench benchmark suite: Characterization of the MapReduce-based data analysis , 2010, 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010).

[12]  Jeffrey S. Chase,et al.  Making Scheduling "Cool": Temperature-Aware Workload Placement in Data Centers , 2005, USENIX Annual Technical Conference, General Track.

[13]  Jean-Luc Gaudiot,et al.  PETS: Performance, energy and thermal aware scheduler for job mapping with resource allocation in heterogeneous systems , 2016, 2016 IEEE 35th International Performance Computing and Communications Conference (IPCCC).

[14]  Suman Nath,et al.  ThermoCast: a cyber-physical forecasting model for datacenters , 2011, KDD.

[15]  Bruno Sinopoli,et al.  A Cyber–Physical Systems Approach to Data Center Modeling and Control for Energy Efficiency , 2012, Proceedings of the IEEE.

[16]  Ricardo Bianchini,et al.  Mercury and freon: temperature emulation and management for server systems , 2006, ASPLOS XII.

[17]  Wei Huang,et al.  Cooling-Aware Job Scheduling and Node Allocation for Overprovisioned HPC Systems , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[18]  Laxmikant V. Kalé,et al.  A ‘cool’ load balancer for parallel applications , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).