Spatiotemporal Modeling of Node Temperatures in Supercomputers

ABSTRACT Los Alamos National Laboratory is home to many large supercomputing clusters. These clusters require an enormous amount of power (∼500–2000 kW each), and most of this energy is converted into heat. Thus, cooling the components of the supercomputer becomes a critical and expensive endeavor. Recently, a project was initiated to investigate the effect that changes to the cooling system in a machine room had on three large machines that were housed there. Coupled with this goal was the aim to develop a general good-practice for characterizing the effect of cooling changes and monitoring machine node temperatures in this and other machine rooms. This article focuses on the statistical approach used to quantify the effect that several cooling changes to the room had on the temperatures of the individual nodes of the computers. The largest cluster in the room has 1600 nodes that run a variety of jobs during general use. Since extremes temperatures are important, a Normal distribution plus generalized Pareto distribution for the upper tail is used to model the marginal distribution, along with a Gaussian process copula to account for spatio-temporal dependence. A Gaussian Markov random field (GMRF) model is used to model the spatial effects on the node temperatures as the cooling changes take place. This model is then used to assess the condition of the node temperatures after each change to the room. The analysis approach was used to uncover the cause of a problematic episode of overheating nodes on one of the supercomputing clusters. This same approach can easily be applied to monitor and investigate cooling systems at other data centers, as well. Supplementary materials for this article are available online.

[1]  A. Davison,et al.  Statistical Modeling of Spatial Extremes , 2012, 1208.3378.

[2]  Bill Ravens,et al.  An Introduction to Copulas , 2000, Technometrics.

[3]  R. Nelsen An Introduction to Copulas , 1998 .

[4]  Alan E. Gelfand,et al.  Spatial process modelling for univariate and multivariate dynamic spatial data , 2005 .

[5]  C. Peirce An unpublished manuscript) , 2016 .

[6]  N. Cressie,et al.  Classes of nonseparable, spatio-temporal stationary covariance functions , 1999 .

[7]  Leonhard Held,et al.  Gaussian Markov Random Fields: Theory and Applications , 2005 .

[8]  Stan Z. Li,et al.  Markov Random Field Modeling in Image Analysis , 2001, Computer Science Workbench.

[9]  S. E. Michalak,et al.  Assessment of the Impact of Cosmic-Ray-Induced Neutrons on Hardware in the Roadrunner Supercomputer , 2012, IEEE Transactions on Device and Materials Reliability.

[10]  Yoshua Bengio,et al.  A hybrid Pareto model for asymmetric fat-tailed data: the univariate case , 2009 .

[11]  Heather Quinn,et al.  A Bayesian Reliability Analysis of Neutron-Induced Errors in High Performance Computing Hardware , 2013 .

[12]  Yoshua Bengio,et al.  A Hybrid Pareto Model for Asymmetric Fat-Tail Data , 2006 .

[13]  B. Carlin,et al.  Spatial Analyses of Periodontal Data Using Conditionally Autoregressive Priors Having Two Classes of Neighbor Relations , 2007 .

[14]  Laurens de Haan,et al.  Stationary max-stable fields associated to negative definite functions. , 2008, 0806.2780.

[15]  M. Wall A close look at the spatial structure implied by the CAR and SAR models , 2004 .

[16]  Jonathan A. Tawn,et al.  Dependence modelling for spatial extremes , 2012 .

[17]  Jack J. Dongarra,et al.  LINPACK Benchmark , 2011, Encyclopedia of Parallel Computing.

[18]  T. Gneiting Nonseparable, Stationary Covariance Functions for Space–Time Data , 2002 .

[19]  B. Reich Spatiotemporal quantile regression for detecting distributional changes in environmental processes , 2012, Journal of the Royal Statistical Society. Series C, Applied statistics.

[20]  Richard L. Smith,et al.  MAX-STABLE PROCESSES AND SPATIAL EXTREMES , 2005 .

[21]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[22]  S. Coles,et al.  An Introduction to Statistical Modeling of Extreme Values , 2001 .

[23]  A. Frigessi,et al.  A Dynamic Mixture Model for Unsupervised Tail Estimation without Threshold Selection , 2002 .

[24]  Bradley P Carlin,et al.  MODELING TEMPORAL GRADIENTS IN REGIONALLY AGGREGATED CALIFORNIA ASTHMA HOSPITALIZATION DATA. , 2013, The annals of applied statistics.

[25]  Alan E. Gelfand,et al.  Hierarchical modeling for extreme values observed over space and time , 2009, Environmental and Ecological Statistics.

[26]  H. Rue Fast sampling of Gaussian Markov random fields , 2000 .

[27]  Markus Junker,et al.  Estimating the tail-dependence coefficient: Properties and pitfalls , 2005 .

[28]  Noel A Cressie,et al.  Statistics for Spatio-Temporal Data , 2011 .

[29]  S. Padoan,et al.  Likelihood-Based Inference for Max-Stable Processes , 2009, 0902.3060.

[30]  A. Davison,et al.  Composite likelihood estimation for the Brown–Resnick process , 2013 .

[31]  J. Tawn,et al.  Efficient inference for spatial extreme value processes associated to log-Gaussian random functions , 2014 .

[32]  Ying C MacNab,et al.  Regression B‐spline smoothing in Bayesian disease mapping: with an application to patient safety surveillance , 2007, Statistics in medicine.

[33]  A. McNeil,et al.  The t Copula and Related Copulas , 2005 .

[34]  David B. Dunson,et al.  Bayesian Data Analysis , 2010 .

[35]  J. R. Wallis,et al.  An Approach to Statistical Spatial-Temporal Modeling of Meteorological Fields , 1994 .

[36]  Yoshua Bengio,et al.  A Hybrid Pareto Mixture for Conditional Asymmetric Fat-Tailed Distributions , 2009, IEEE Transactions on Neural Networks.

[37]  Brian J Reich,et al.  A HIERARCHICAL MAX-STABLE SPATIAL MODEL FOR EXTREME PRECIPITATION. , 2013, The annals of applied statistics.

[38]  R. Tibshirani,et al.  Generalized Additive Models , 1986 .

[39]  Kristen Foley,et al.  Extreme value analysis for evaluating ozone control strategies. , 2013, The annals of applied statistics.

[40]  Scott Pakin,et al.  Modeling and Predicting Power Consumption of High Performance Computing Jobs , 2014 .

[41]  M. Stein Space–Time Covariance Functions , 2005 .

[42]  C. J. Stone,et al.  A study of logspline density estimation , 1991 .

[43]  Janet E. Heffernan,et al.  Dependence Measures for Extreme Value Analyses , 1999 .

[44]  Raphael Huser,et al.  Space–time modelling of extreme events , 2012, 1201.3245.

[45]  Bradley P. Carlin,et al.  Hierarchical Spatio-Temporal Mapping of Disease Rates , 1997 .