Detecting Data Center Cooling Problems Using a Data-driven Approach

Cooling problems are common in data centers and many of them are hard to detect especially the hidden. These problems affect overall system dependability, performance and power efficiency. We propose a novel method to detect the cooling problems. Using common monitoring data available in most data centers, such as environmental temperature and hardware status, we build a workload-independent cooling profile for each server. With the cooling profiles, we are able to detect two types of both transient and lasting cooling failures. We detect transient failures by comparing the observed temperature with the model prediction, while we detect lasting failures by comparing the cooling profiles among different servers. We demonstrate the general applicability of our detection methods in three production data centers with vastly different scale, server types and workload, and detect several real cooling problems that have been hidden for months.

[1]  Vanish Talwar,et al.  No "power" struggles: coordinated multi-level power management for the data center , 2008, ASPLOS.

[2]  Jeffrey S. Chase,et al.  Weatherman: Automated, Online and Predictive Thermal Mapping and Management for Data Centers , 2006, 2006 IEEE International Conference on Autonomic Computing.

[3]  Rajkumar Buyya,et al.  Energy Efficient Resource Management in Virtualized Cloud Data Centers , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[4]  Shahin Nazarian,et al.  Thermal Modeling, Analysis, and Management in VLSI Circuits: Principles and Methods , 2006, Proceedings of the IEEE.

[5]  Erik Riedel,et al.  More Than an Interface - SCSI vs. ATA , 2003, FAST.

[6]  Thomas Brunschwiler,et al.  Toward zero-emission data centers through direct reuse of thermal energy , 2009, IBM J. Res. Dev..

[7]  V. Mulay,et al.  Thermal design in the open compute datacenter , 2012, 13th InterSociety Conference on Thermal and Thermomechanical Phenomena in Electronic Systems.

[8]  Guoliang Xing,et al.  A High-Fidelity Temperature Distribution Forecasting System for Data Centers , 2012, 2012 IEEE 33rd Real-Time Systems Symposium.

[9]  Chen-Yong Cher,et al.  Temperature Variation Characterization and Thermal Management of Multicore Architectures , 2009, IEEE Micro.

[10]  Amarendra Singh,et al.  Thermal influence indices: Causality metrics for efficient exploration of data center cooling , 2012, 2012 International Green Computing Conference (IGCC).

[11]  Wolf-Dietrich Weber,et al.  Power provisioning for a warehouse-sized computer , 2007, ISCA '07.

[12]  Ratnesh K. Sharma,et al.  A holistic and optimal approach for data center cooling management , 2011, Proceedings of the 2011 American Control Conference.

[13]  Zhenhua Liu,et al.  Towards the design and operation of net-zero energy data centers , 2012, 13th InterSociety Conference on Thermal and Thermomechanical Phenomena in Electronic Systems.

[14]  Gustavo Rau de Almeida Callou,et al.  Models for dependability and sustainability analysis of data center cooling architectures , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012).

[15]  Graeme R. Cole Estimating Drive Reliability in Desktop Computers and Consumer Electronics , 2003 .

[16]  Sriram Sankar,et al.  Impact of temperature on hard disk drive reliability in large datacenters , 2011, 2011 IEEE/IFIP 41st International Conference on Dependable Systems & Networks (DSN).

[17]  Thomas Schwenkler,et al.  Intelligent Platform Management Interface , 2006 .

[18]  Jim Gao,et al.  Machine Learning Applications for Data Center Optimization , 2014 .

[19]  Guilherme Hoefel Learning a two-stage SVM/CRF sequence classifier , 2008, CIKM '08.

[20]  Jeffrey S. Chase,et al.  Making Scheduling "Cool": Temperature-Aware Workload Placement in Data Centers , 2005, USENIX Annual Technical Conference, General Track.

[21]  Xi He,et al.  Towards Thermal Aware Workload Scheduling in a Data Center , 2009, 2009 10th International Symposium on Pervasive Systems, Algorithms, and Networks.

[22]  Kushagra Vaid,et al.  ACE: Abstracting, characterizing and exploiting datacenter power demands , 2013, 2013 IEEE International Symposium on Workload Characterization (IISWC).

[23]  Lizhe Wang,et al.  Thermal aware workload placement with task-temperature profiles in a data center , 2011, The Journal of Supercomputing.

[24]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[25]  Roger R. Schmidt,et al.  Improved CFD modeling of a small data center test cell , 2010, 2010 12th IEEE Intersociety Conference on Thermal and Thermomechanical Phenomena in Electronic Systems.

[26]  Noah Treuhaft,et al.  Recovery Oriented Computing (ROC): Motivation, Definition, Techniques, and Case Studies , 2002 .

[27]  Jim Gray,et al.  Why Do Computers Stop and What Can Be Done About It? , 1986, Symposium on Reliability in Distributed Software and Database Systems.

[28]  Christian Belady,et al.  GREEN GRID DATA CENTER POWER EFFICIENCY METRICS: PUE AND DCIE , 2008 .

[29]  Gareth Halfacree,et al.  Raspberry Pi User Guide , 2012 .

[30]  Baochun Li,et al.  Temperature Aware Workload Managementin Geo-Distributed Data Centers , 2013, IEEE Transactions on Parallel and Distributed Systems.

[31]  Mark Seymour,et al.  Design and management of data center effectiveness, risks and costs , 2012, 2012 28th Annual IEEE Semiconductor Thermal Measurement and Management Symposium (SEMI-THERM).

[32]  Guoliang Xing,et al.  Leveraging thermal dynamics in sensor placement for overheating server component detection , 2012, 2012 International Green Computing Conference (IGCC).

[33]  Jinkyun Cho,et al.  Evaluation of air distribution system's airflow performance for cooling energy savings in high-density data centers , 2014 .

[34]  Cheng-Xian Lin,et al.  Improving Cooling Efficiency by Using Mixed Tiles to Control Airflow Uniformity of Perforated Tiles in a Data Center Model , 2017 .

[35]  Bianca Schroeder,et al.  Temperature management in data centers: why some (might) like it hot , 2012, SIGMETRICS '12.

[36]  Eduardo Pinheiro,et al.  DRAM errors in the wild: a large-scale field study , 2009, SIGMETRICS '09.

[37]  M. Ohadi,et al.  Measured and simulated energy consumption analysis of a data center on an academic campus , 2013, 29th IEEE Semiconductor Thermal Measurement and Management Symposium.