Minimizing Thermal Variation in Heterogeneous HPC Systems with FPGA Nodes

The presence of FPGAs in data centers has been growing due to their superior performance as accelerators. Thermal management, particularly battling the cooling cost in these high performance systems, is a primary concern. Introduction of new heterogeneous components only adds further complexities to thermal modeling and management. The thermal behavior of multi-FPGA systems deployed within large compute clusters is little explored. In this paper, we first show that the thermal behaviors of different FPGAs of the same generation can vary due to their physical locations in a rack and process variation, even though they are running the same tasks. We present a machine learning based model to capture the thermal behavior of a multi-node FPGA cluster. We then propose to mitigate thermal variation and hotspots across the cluster by proactive task placement guided by our thermal model. Our experiments show that through proper placement of tasks on the multi-FPGA system, we can reduce the peak temperature by up to 11.50°C with no impact on performance.

[1]  Minglu Li,et al.  Energy-efficient scheduling on multi-FPGA reconfigurable systems , 2013, Microprocess. Microsystems.

[2]  Seda Ogrenci-Memik Heat Management in Integrated Circuits: On-chip and system-level monitoring and cooling , 2015 .

[3]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[4]  Salim Hariri,et al.  Performance-Effective and Low-Complexity Task Scheduling for Heterogeneous Computing , 2002, IEEE Trans. Parallel Distributed Syst..

[5]  Alexander Aiken,et al.  Legion: Expressing locality and independence with logical regions , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[6]  P. Norris,et al.  Brief Historical Perspective in Thermal Management and the Shift Toward Management at the Nanoscale , 2019 .

[7]  Gokhan Memik,et al.  Machine Learning-Based Temperature Prediction for Runtime Thermal Management Across System Components , 2018, IEEE Transactions on Parallel and Distributed Systems.

[8]  Seda Ogrenci Memik,et al.  Minimizing Thermal Variation Across System Components , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[9]  Gerard F. Jones,et al.  A review of data center cooling technology, operating conditions and the corresponding low-grade waste heat recovery opportunities , 2014 .

[10]  Gregory Diamos,et al.  Harmony: an execution model and runtime for heterogeneous many core systems , 2008, HPDC '08.

[11]  Sandeep K. S. Gupta,et al.  Energy-Efficient Thermal-Aware Task Scheduling for Homogeneous High-Performance Computing Data Centers: A Cyber-Physical Approach , 2008, IEEE Transactions on Parallel and Distributed Systems.

[12]  Lian-Tuu Yeh,et al.  Thermal management of microelectronic equipment : heat transfer theory, analysis methods, and design practices , 2002 .

[13]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[14]  Hari Angepat,et al.  A cloud-scale acceleration architecture , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[15]  Eugene M. Kleinberg,et al.  On the Algorithmic Implementation of Stochastic Discrimination , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[16]  Stephen P. Boyd,et al.  Temperature-aware processor frequency assignment for MPSoCs using convex optimization , 2007, 2007 5th IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[17]  Joe Mambretti,et al.  Next Generation Clouds, the Chameleon Cloud Testbed, and Software Defined Networking (SDN) , 2015, 2015 International Conference on Cloud Computing Research and Innovation (ICCCRI).

[18]  Narayanan Vijaykrishnan,et al.  Thermal-aware reliability analysis for Platform FPGAs , 2008, 2008 IEEE/ACM International Conference on Computer-Aided Design.

[19]  Michael M. Ohadi,et al.  The Telecom Industry and Data Centers , 2014 .

[20]  Tajana Simunic,et al.  Temperature Aware Task Scheduling in MPSoCs , 2007, 2007 Design, Automation & Test in Europe Conference & Exhibition.

[21]  E. Kleinberg An overtraining-resistant stochastic modeling method for pattern recognition , 1996 .

[22]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..