Cloud dependability analysis: Characterizing Google cluster infrastructure reliability

Cloud computing data centers offer high available and reliable infrastructures for hosting critical applications and data. These data centers host hundreds of thousands physical machines to response to incoming workload as job executing. In this paper, we analyze the Google cloud cluster properties to investigate the relationship among machine failures, updates, and job failures. We present the statistical properties of Google machines and job failures and attempt to correlate them during a 29-day period behave. We classify the machine and job failures per day and represent a reliability model for Google cluster machines using the Continues Time Markov Chains.

[1]  Anthony T. Chronopoulos,et al.  A Resilient Hierarchical Distributed Loop Self-Scheduling Scheme for Cloud Systems , 2014, 2014 IEEE 13th International Symposium on Network Computing and Applications.

[2]  Randy H. Katz,et al.  Heterogeneity and dynamicity of clouds at scale: Google trace analysis , 2012, SoCC '12.

[3]  Amir Masoud Rahmani,et al.  Performance evaluation and analysis of load balancing algorithms in cloud computing environments , 2016, 2016 Second International Conference on Web Research (ICWR).

[4]  Kishor S. Trivedi,et al.  Stochastic Modeling Formalisms for Dependability, Performance and Performability , 2000, Performance Evaluation.

[5]  Mansaf Alam,et al.  Analysis and Clustering of Workload in Google Cluster Trace Based on Resource Usage , 2015, 2016 IEEE Intl Conference on Computational Science and Engineering (CSE) and IEEE Intl Conference on Embedded and Ubiquitous Computing (EUC) and 15th Intl Symposium on Distributed Computing and Applications for Business Engineering (DCABES).

[6]  Prashant Malik,et al.  Cassandra: a decentralized structured storage system , 2010, OPSR.

[7]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.

[8]  Amir Masoud Rahmani,et al.  Cloud light weight: A new solution for load balancing in cloud computing , 2014, 2014 International Conference on Data Science & Engineering (ICDSE).

[9]  Kishor S. Trivedi,et al.  SHARPE at the age of twenty two , 2009, PERV.

[10]  Ning Hu,et al.  Research on dependability of cloud computing systems , 2014, 2014 10th International Conference on Reliability, Maintainability and Safety (ICRMS).

[11]  Anthony T. Chronopoulos,et al.  A Hierarchical Distributed Loop Self-Scheduling Scheme for Cloud Systems , 2013, 2013 IEEE 12th International Symposium on Network Computing and Applications.

[12]  Sangyeun Cho,et al.  Characterizing Machines and Workloads on a Google Cluster , 2012, 2012 41st International Conference on Parallel Processing Workshops.

[13]  S. Seo A Review and Comparison of Methods for Detecting Outliers in Univariate Data Sets , 2006 .

[14]  Amir Masoud Rahmani,et al.  Highly reliable architecture using the 80/20 rule in cloud computing datacenters , 2017, Future Gener. Comput. Syst..

[15]  Archana Ganapathi,et al.  Analysis and Lessons from a Publicly Available Google Cluster Trace , 2010 .

[16]  Amir Masoud Rahmani,et al.  Load Balancing in Cloud Computing: A State of the Art Survey , 2016 .

[17]  Kashi Venkatesh Vishwanath,et al.  Characterizing cloud computing hardware reliability , 2010, SoCC '10.

[18]  Xin Chen,et al.  Failure Analysis of Jobs in Compute Clouds: A Google Cluster Case Study , 2014, 2014 IEEE 25th International Symposium on Software Reliability Engineering.