Spatio-temporal thermal-aware scheduling for homogeneous high-performance computing datacenters

Datacenters have become an important part of todays computing infrastructure. Recent studies have shown the increasing importance of thermal considerations to achieve effective resource management. In this paper, we study thermal-aware scheduling for homogeneous high-performance computing (HPC) datacenters under a thermal model that captures both spatial and temporal correlations of the temperature evolution. We propose an online scheduling heuristic to minimize the makespan for a set of HPC applications subject to a thermal constraint. The heuristic leverages the novel notion of thermal-aware load to perform both job assignment and thermal management. To respect the temperature constraint, which is governed by a complex spatio-temporal thermal correlation, dynamic voltage and frequency scaling (DVFS) is used to regulate the job executions during runtime while dynamically balancing the loads of the servers to improve makespan. Extensive simulations are conducted based on an experimentally validated datacenter configuration and realistic parameter settings. The results show improved performance of the proposed heuristic compared to existing solutions in the literature, and demonstrate the importance of both spatial and temporal considerations. In contrast to some scheduling problems, where DVFS introduces performanceenergy tradeoffs, our findings reveal the benefit of applying DVFS with both performance and energy gains in the context of spatio-temporal thermal-aware scheduling. Thermal model capturing both spatial and temporal temperature correlations in datacenters.Formulation of a spatio-temporal thermal-aware scheduling problem for HPC applications.Scheduling heuristic using thermal-aware load for job assignment and thermal management.Simulations to show the effectiveness of heuristic under a wide range of parameters.

[1]  Salim Hariri,et al.  Performance-Effective and Low-Complexity Task Scheduling for Heterogeneous Computing , 2002, IEEE Trans. Parallel Distributed Syst..

[2]  Allen B. Downey,et al.  A parallel workload model and its implications for processor allocation , 1996, Proceedings. The Sixth IEEE International Symposium on High Performance Distributed Computing (Cat. No.97TB100183).

[3]  Teck Chaw Ling,et al.  Thermal-Aware Scheduling in Green Data Centers , 2015, ACM Comput. Surv..

[4]  Bernd Freisleben,et al.  A comparative study of online scheduling algorithms for networks of workstations , 2000, Cluster Computing.

[5]  Gerard F. Jones,et al.  A review of data center cooling technology, operating conditions and the corresponding low-grade waste heat recovery opportunities , 2014 .

[6]  Marek Chrobak,et al.  Dynamic Thermal Management through Task Scheduling , 2008, ISPASS 2008 - IEEE International Symposium on Performance Analysis of Systems and software.

[7]  Dror G. Feitelson,et al.  Workload Modeling for Computer Systems Performance Evaluation , 2015 .

[8]  F. Frances Yao,et al.  A scheduling model for reduced CPU energy , 1995, Proceedings of IEEE 36th Annual Foundations of Computer Science.

[9]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[10]  Jeffrey S. Chase,et al.  Weatherman: Automated, Online and Predictive Thermal Mapping and Management for Data Centers , 2006, 2006 IEEE International Conference on Autonomic Computing.

[11]  Raj Jain,et al.  The art of computer systems performance analysis - techniques for experimental design, measurement, simulation, and modeling , 1991, Wiley professional computing.

[12]  Massoud Pedram,et al.  Minimizing data center cooling and server power costs , 2009, ISLPED.

[13]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[14]  Samir Khuller,et al.  Algorithms for the Thermal Scheduling Problem , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[15]  Marek Chrobak,et al.  Algorithms for Temperature-Aware Task Scheduling in Microprocessor Systems , 2008, AAIM.

[16]  Larry Rudolph,et al.  Towards Convergence in Job Schedulers for Parallel Supercomputers , 1996, JSSPP.

[17]  Jean-Marc Pierson,et al.  Energy-Efficient and Thermal-Aware Resource Management for Heterogeneous Datacenters , 2014, Sustain. Comput. Informatics Syst..

[18]  Ricardo Bianchini,et al.  C-Oracle: Predictive thermal management for data centers , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[19]  Seda Ogrenci Memik,et al.  Minimizing Thermal Variation Across System Components , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[20]  Sandeep K. S. Gupta,et al.  Energy-Efficient Thermal-Aware Task Scheduling for Homogeneous High-Performance Computing Data Centers: A Cyber-Physical Approach , 2008, IEEE Transactions on Parallel and Distributed Systems.

[21]  Laxmikant V. Kalé,et al.  A ‘cool’ way of improving the reliability of HPC machines , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[22]  Philip S. Yu,et al.  Temperature-Aware Scheduling: When is System-Throttling Good Enough? , 2008, 2008 The Ninth International Conference on Web-Age Information Management.

[23]  Kirk Pruhs,et al.  Speed scaling to manage energy and temperature , 2007, JACM.

[24]  Lizhe Wang,et al.  Thermal aware workload placement with task-temperature profiles in a data center , 2011, The Journal of Supercomputing.

[25]  Ayan Banerjee,et al.  Spatio-temporal thermal-aware job scheduling to minimize energy consumption in virtualized heterogeneous data centers , 2009, Comput. Networks.

[26]  Ricardo Bianchini,et al.  Mercury and freon: temperature emulation and management for server systems , 2006, ASPLOS XII.

[27]  Kevin Skadron,et al.  Temperature-aware microarchitecture: Modeling and implementation , 2004, TACO.

[28]  Maziar Goudarzi,et al.  Power reduction in HPC data centers: a joint server placement and chassis consolidation approach , 2014, The Journal of Supercomputing.

[29]  Kevin Skadron,et al.  Control-theoretic techniques and thermal-RC modeling for accurate and localized dynamic thermal management , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.

[30]  Jeffrey S. Chase,et al.  Making Scheduling "Cool": Temperature-Aware Workload Placement in Data Centers , 2005, USENIX Annual Technical Conference, General Track.

[31]  Jaume Salom,et al.  Energy-efficient, thermal-aware modeling and simulation of data centers: The CoolEmAll approach and evaluation results , 2015, Ad Hoc Networks.

[32]  Meeta Sharma Gupta,et al.  System level analysis of fast, per-core DVFS using on-chip switching regulators , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[33]  Raj Jain,et al.  The Art of Computer Systems Performance Analysis : Tech-niques for Experimental Design , 1991 .

[34]  Ariel Oleksiak,et al.  Energy and thermal models for simulation of workload and resource management in computing systems , 2015, Simul. Model. Pract. Theory.

[35]  M. Iyengar,et al.  Perforated tile models for improving data center CFD simulation , 2012, 13th InterSociety Conference on Thermal and Thermomechanical Phenomena in Electronic Systems.

[36]  Wu-chun Feng,et al.  Making a Case for Efficient Supercomputing , 2003, ACM Queue.

[37]  Manish Gupta,et al.  Power-Aware Microarchitecture: Design and Modeling Challenges for Next-Generation Microprocessors , 2000, IEEE Micro.

[38]  Karam S. Chatha,et al.  Approximation algorithm for the temperature-aware scheduling problem , 2007, 2007 IEEE/ACM International Conference on Computer-Aided Design.

[39]  Ronald L. Graham,et al.  Bounds on Multiprocessing Timing Anomalies , 1969, SIAM Journal of Applied Mathematics.

[40]  Bianca Schroeder,et al.  An Experimental Study of Online Scheduling Algorithms , 2000, Algorithm Engineering.

[41]  Meikang Qiu,et al.  TIGER: Thermal-Aware File Assignment in Storage Clusters , 2016, IEEE Trans. Parallel Distributed Syst..

[42]  Mahmut T. Kandemir,et al.  Leakage Current: Moore's Law Meets Static Power , 2003, Computer.

[43]  Joonwon Lee,et al.  A CFD-Based Tool for Studying Temperature in Rack-Mounted Servers , 2008, IEEE Transactions on Computers.

[44]  Sandeep K. S. Gupta,et al.  Holistic Management of Sustainable Geo-Distributed Data Centers , 2015, 2015 IEEE 22nd International Conference on High Performance Computing (HiPC).

[46]  Athanasios V. Vasilakos,et al.  Thermal-Aware Scheduling of Batch Jobs in Geographically Distributed Data Centers , 2014, IEEE Transactions on Cloud Computing.

[47]  Qinghui Tang,et al.  Sensor-Based Fast Thermal Evaluation Model For Energy Efficient High-Performance Datacenters , 2006, 2006 Fourth International Conference on Intelligent Sensing and Information Processing.

[48]  Paolo Cremonesi,et al.  Cooling-aware workload placement with performance constraints , 2011, Perform. Evaluation.

[49]  Jiayi Sheng,et al.  Communication and cooling aware job allocation in data centers for communication-intensive workloads , 2016, J. Parallel Distributed Comput..

[50]  Mor Harchol-Balter The Effect of Heavy-Tailed Job Size Distributions on Computer System Design , 1999 .

[51]  Laxmikant V. Kalé,et al.  "Cool" Load Balancing for High Performance Computing Data Centers , 2012, IEEE Trans. Computers.