Environmental Conditions and Disk Reliability in Free-cooled Datacenters

Free cooling lowers datacenter costs significantly, but may also expose servers to higher and more variable temperatures and relative humidities. It is currently unclear whether these environmental conditions have a significant impact on hardware component reliability. Thus, in this paper, we use data from nine hyperscale datacenters to study the impact of environmental conditions on the reliability of server hardware, with a particular focus on disk drives and free cooling. Based on this study, we derive and validate a new model of disk lifetime as a function of environmental conditions. Furthermore, we quantify the tradeoffs between energy consumption, environmental conditions, component reliability, and datacenter costs. Finally, based on our analyses and model, we derive server and datacenter design lessons. We draw many interesting observations, including (1) relative humidity seems to have a dominant impact on component failures; (2) disk failures increase significantly when operating at high relative humidity, due to controller/adaptor malfunction; and (3) though higher relative humidity increases component failures, software availability techniques can mask them and enable free-cooled operation, resulting in significantly lower infrastructure and energy costs that far outweigh the cost of the extra component failures.

[1]  William Q. Meeker,et al.  A Review of Accelerated Test Models , 2006, 0708.0369.

[2]  Shankar Pasupathy,et al.  An analysis of latent sector errors in disk drives , 2007, SIGMETRICS '07.

[3]  Steve Greenberg,et al.  Best Practices for Data Centers: Lessons Learned from Benchmarking 22 Data Centers , 2006 .

[4]  Bruce Allen,et al.  Monitoring hard disks with smart , 2004 .

[5]  David Atienza,et al.  Free cooling-aware dynamic power management for green datacenters , 2012, 2012 International Conference on High Performance Computing & Simulation (HPCS).

[6]  D. Stewart Peck,et al.  Comprehensive Model for Humidity Testing Correlation , 1986, 24th International Reliability Physics Symposium.

[7]  Bianca Schroeder,et al.  Understanding disk failure rates: What does an MTTF of 1,000,000 hours mean to you? , 2007, TOS.

[8]  D. W. Rice,et al.  Atmospheric Corrosion of Copper and Silver , 1981 .

[9]  Manish Marwah,et al.  Delivering Energy Proportionality with Non Energy-Proportional Systems - Optimizing the Ensemble , 2008, HotPower.

[10]  Luiz André Barroso,et al.  The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, Second Edition , 2013, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, Second Edition.

[11]  Jeffrey S. Chase,et al.  Making Scheduling "Cool": Temperature-Aware Workload Placement in Data Centers , 2005, USENIX Annual Technical Conference, General Track.

[12]  Luiz André Barroso,et al.  The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines , 2009, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines.

[13]  Eduardo Pinheiro,et al.  Failure Trends in a Large Disk Drive Population , 2007, FAST.

[14]  Graeme R. Cole Estimating Drive Reliability in Desktop Computers and Consumer Electronics , 2003 .

[15]  Ayan Banerjee,et al.  Cooling-aware and thermal-aware workload placement for green HPC data centers , 2010, International Conference on Green Computing.

[16]  Faraz Ahmad,et al.  Joint optimization of idle and cooling power in data centers while maintaining response time , 2010, ASPLOS 2010.

[17]  P. R. Roberge,et al.  Corrosion of metallic materials , 1995 .

[18]  Fred Douglis,et al.  RAIDShield: Characterizing, Monitoring, and Proactively Protecting Against Disk Failures , 2015, FAST.

[19]  Kashi Venkatesh Vishwanath,et al.  Characterizing cloud computing hardware reliability , 2010, SoCC '10.

[20]  Cullen E. Bash,et al.  Thermal considerations in cooling large scale high compute density data centers , 2002, ITherm 2002. Eighth Intersociety Conference on Thermal and Thermomechanical Phenomena in Electronic Systems (Cat. No.02CH37258).

[21]  T. N. Vijaykumar,et al.  Joint optimization of idle and cooling power in data centers while maintaining response time , 2010, ASPLOS XV.

[22]  Albert G. Greenberg,et al.  The cost of a cloud: research problems in data center networks , 2008, CCRV.

[23]  Adam Wierman,et al.  Renewable and cooling aware workload management for sustainable data centers , 2012, SIGMETRICS '12.

[24]  Pradip Bose,et al.  The case for lifetime reliability-aware microprocessors , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[25]  Bianca Schroeder,et al.  Temperature management in data centers: why some (might) like it hot , 2012, SIGMETRICS '12.

[26]  Sriram Sankar,et al.  Datacenter Scale Evaluation of the Impact of Temperature on Hard Disk Drive Failures , 2013, TOS.

[27]  S. Shah,et al.  Reliability analysis of disk drive failure mechanisms , 2005, Annual Reliability and Maintainability Symposium, 2005. Proceedings..

[28]  M.K. Patterson,et al.  The effect of data center temperature on energy efficiency , 2008, 2008 11th Intersociety Conference on Thermal and Thermomechanical Phenomena in Electronic Systems.

[29]  C.D. Patel,et al.  Dynamic thermal management of air cooled data centers , 2006, Thermal and Thermomechanical Proceedings 10th Intersociety Conference on Phenomena in Electronics Systems, 2006. ITHERM 2006..

[30]  K. Anubhav,et al.  Use of airside economizer for data center thermal management , 2008, 2008 Second International Conference on Thermal Issues in Emerging Technologies.