Characterizing machines lifecycle in Google data centers

Abstract Due to the increasing need for computational power, the market has shifted towards big centralized data centers. Understanding the nature of the dynamics of these data centers from machine and job/task perspective is critical to design efficient data center management policies like optimal resource/power utilization, capacity planning and optimal (reactive and proactive) maintenance scheduling. Whereas jobs/tasks dynamics have received a lot of attention, the study of the dynamics of the underlying machines supporting the jobs/tasks execution has received much less attention, even when these dynamics would substantially affect the performance of the jobs/tasks execution. Given the limited data available from large computing installations, only a few previous studies have inspected data centers and only concerning failures and their root causes. In this paper, we study the 2011 Google data center traces from the machine dynamics perspective. First, we characterize the machine events and their underlying distributions in order to have a better understanding of the entire machine lifecycle. Second, we propose a data-driven model to enable the estimate of the expected number of available machines at any instant of time. The model is parameterized and validated using the empirical data collected by Google during a one month period.

[1]  Richard P. Martin,et al.  Improving cluster availability using workstation validation , 2002, SIGMETRICS '02.

[2]  Krishna B. Misra,et al.  Handbook of Performability Engineering , 2008 .

[3]  Gunter Bolch,et al.  Queueing Networks and Markov Chains - Modeling and Performance Evaluation with Computer Science Applications, Second Edition , 1998 .

[4]  Stefano Sebastio,et al.  MultiVeStA: statistical model checking for discrete event simulators , 2013, VALUETOOLS.

[5]  Navendu Jain,et al.  Understanding network failures in data centers , 2011, SIGCOMM 2011.

[6]  Ewa Deelman,et al.  Failure analysis of distributed scientific workflows executing in the cloud , 2012, 2012 8th international conference on network and service management (cnsm) and 2012 workshop on systems virtualiztion management (svm).

[7]  Lei Shi,et al.  Cost Minimization Algorithms for Data Center Management , 2017, IEEE Transactions on Parallel and Distributed Systems.

[8]  Bianca Schroeder,et al.  Battling borked bits , 2015, IEEE Spectrum.

[9]  Richard Wolski,et al.  Modeling Machine Availability in Enterprise and Wide-Area Distributed Computing Environments , 2005, Euro-Par.

[10]  Mladen A. Vouk,et al.  On mining data across software repositories , 2009, 2009 6th IEEE International Working Conference on Mining Software Repositories.

[11]  Alberto Lluch-Lafuente,et al.  AVOCLOUDY: a simulator of volunteer clouds , 2016, Softw. Pract. Exp..

[12]  Chita R. Das,et al.  Towards characterizing cloud backend workloads: insights from Google compute clusters , 2010, PERV.

[13]  Daniel A. Menascé Performance and availability of Internet data centers , 2004, IEEE Internet Computing.

[14]  Kishor S. Trivedi,et al.  Analysis of Software Aging in a Web Server , 2006, IEEE Transactions on Reliability.

[15]  Peng Wang,et al.  Repairable systems reliability trend tests and evaluation , 2005, Annual Reliability and Maintainability Symposium, 2005. Proceedings..

[16]  Daniel P. Siewiorek,et al.  Error log analysis: statistical modeling and heuristic trend analysis , 1990 .

[17]  Michael Isard,et al.  Autopilot: automatic data center management , 2007, OPSR.

[18]  Kishor S. Trivedi,et al.  Scalable Analytics for IaaS Cloud Availability , 2014, IEEE Transactions on Cloud Computing.

[19]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2010, IEEE Trans. Dependable Secur. Comput..