Resource Availability Prediction in Distributed Systems: An Approach for Modeling Non-Stationary Transition Probabilities

Large scale distributed systems employ thousands of resources which inevitably suffer from the unavailability issue. Serious side effects like unexpected delay or failure in the application execution are probable in case of such an issue. The imposed outcome might then be catastrophic consequences for real time applications or penalties for the service providers. Better prediction of the resource unavailability helps diminishing the undesired outcomes. This paper proposes a resource availability prediction algorithm for the mentioned goal. The resource availability variation is modeled as a stochastic process. By analyzing the availability information of NDU resources and both physical and virtual machines of the PlantLab, we found that the transition probabilities among the availability levels are non-stationary. To cope with this characteristic, we introduce Availability Transition Patterns (ATPs); the ATPs are dynamically constructed and the transitions between them are modeled by a Markov chain. The future ATP is then predicted based on the constructed Markov chain, according to which the resource availability-level is predicted. Experimental results confirm the efficiency of the proposed prediction algorithm.

[1]  Yashwant K. Malaiya,et al.  Analysis of an Important Class of Non-Markov Systems , 1982, IEEE Transactions on Reliability.

[2]  Rudolf Eigenmann,et al.  Prediction of Resource Availability in Fine-Grained Cycle Sharing Systems Empirical Evaluation , 2007, Journal of Grid Computing.

[3]  Xiaohui Gu,et al.  Resilient Self-Compressive Monitoring for Large-Scale Hosting Infrastructures , 2013, IEEE Transactions on Parallel and Distributed Systems.

[4]  Marco Vieira,et al.  Adaptive Failure Prediction for Computer Systems: A Framework and a Case Study , 2015, 2015 IEEE 16th International Symposium on High Assurance Systems Engineering.

[5]  Poul E. Heegaard,et al.  Differentiated Availability in Cloud Computing SLAs , 2011, 2011 IEEE/ACM 12th International Conference on Grid Computing.

[6]  Xiaohui Gu,et al.  PREPARE: Predictive Performance Anomaly Prevention for Virtualized Cloud Systems , 2012, 2012 IEEE 32nd International Conference on Distributed Computing Systems.

[7]  Wu Bin,et al.  A Markov Chain Based Resource Prediction in Computational Grid , 2009, 2009 Fourth International Conference on Frontier of Computer Science and Technology.

[8]  Lorenzo Ridi,et al.  Transient analysis of non-Markovian models using stochastic state classes , 2012, Perform. Evaluation.

[9]  Michael J. Lewis,et al.  Grid Resource Availability Prediction-Based Scheduling and Task Replication , 2009, Journal of Grid Computing.

[10]  Haiying Shen,et al.  An Efficient and Trustworthy Resource Sharing Platform for Collaborative Cloud Computing , 2014, IEEE Transactions on Parallel and Distributed Systems.

[11]  Lavanya Ramakrishnan,et al.  Predictable quality of service atop degradable distributed systems , 2009, Cluster Computing.

[12]  Hanan Lutfiyya,et al.  Decentralized approach to resource availability prediction using group availability in a P2P desktop grid , 2012, Future Gener. Comput. Syst..

[13]  K. Cranmer,et al.  Asymptotic formulae for likelihood-based tests of new physics , 2010, 1007.1727.

[14]  Richard Wolski,et al.  Automatic methods for predicting machine availability in desktop Grid and peer-to-peer systems , 2004, IEEE International Symposium on Cluster Computing and the Grid, 2004. CCGrid 2004..

[15]  Bo Li,et al.  Coping With Heterogeneous Video Contributors and Viewers in Crowdsourced Live Streaming: A Cloud-Based Approach , 2016, IEEE Transactions on Multimedia.

[16]  Larry L. Peterson,et al.  The design principles of PlanetLab , 2006, OPSR.

[17]  David E. Culler,et al.  PlanetLab: an overlay testbed for broad-coverage services , 2003, CCRV.

[18]  A Survey of Peer-to-Peer Networks , 2005 .

[19]  Dongxia Wang,et al.  DAC‐Hmm: detecting anomaly in cloud systems with hidden Markov models , 2015, Concurr. Comput. Pract. Exp..

[20]  Ian Lumb,et al.  A Taxonomy and Survey of Cloud Computing Systems , 2009, 2009 Fifth International Joint Conference on INC, IMS and IDC.

[21]  Jean-Marc Vincent,et al.  Discovering Statistical Models of Availability in Large Distributed Systems: An Empirical Study of SETI@home , 2011, IEEE Transactions on Parallel and Distributed Systems.

[22]  Franck Cappello,et al.  Failure prediction for HPC systems and applications , 2013, Int. J. High Perform. Comput. Appl..

[23]  Stephan Philippi,et al.  Analysis of fault tolerance and reliability in distributed real-time system architectures , 2003, Reliab. Eng. Syst. Saf..

[24]  Chen-Khong Tham,et al.  Analysis and optimization of service availability in a HA cluster with load-dependent machine availability , 2007, IEEE Transactions on Parallel and Distributed Systems.

[25]  Chong-Sun Hwang,et al.  MJSA: Markov job scheduler based on availability in desktop grid computing environment , 2007, Future Gener. Comput. Syst..

[26]  Kishor S. Trivedi,et al.  A scalable availability model for Infrastructure-as-a-Service cloud , 2011, 2011 IEEE/IFIP 41st International Conference on Dependable Systems & Networks (DSN).

[27]  Nasrollah Moghaddam Charkari,et al.  A grid workflow Quality-of-Service estimation based on resource availability prediction , 2013, The Journal of Supercomputing.

[28]  Miroslaw Malek,et al.  Using Hidden Semi-Markov Models for Effective Online Failure Prediction , 2007, 2007 26th IEEE International Symposium on Reliable Distributed Systems (SRDS 2007).

[29]  Kishor S. Trivedi,et al.  Survivability as a generalization of recovery , 2015, 2015 11th International Conference on the Design of Reliable Communication Networks (DRCN).

[30]  David E. Culler,et al.  Operating Systems Support for Planetary-Scale Network Services , 2004, NSDI.

[31]  Rada Chirkova,et al.  Analysis of Response Time Percentile Service Level Agreements in SOA-Based Applications , 2011, 2011 IEEE Global Telecommunications Conference - GLOBECOM 2011.

[32]  Brian D. Noble,et al.  Exploiting Availability Prediction in Distributed Systems , 2006, NSDI.

[33]  Ziming Zhang,et al.  Proactive Failure Management by Integrated Unsupervised and Semi-Supervised Learning for Dependable Cloud Systems , 2011, 2011 Sixth International Conference on Availability, Reliability and Security.

[34]  Xinwen Fu,et al.  Optimizing Aggregate Query Processing in Cloud Data Warehouses , 2014, Globe.

[35]  Mehdi Kargahi,et al.  Reliability-driven scheduling of time/cost-constrained grid workflows , 2016, Future Gener. Comput. Syst..

[36]  Alexandru Iosup,et al.  The Failure Trace Archive: Enabling the comparison of failure measurements and models of distributed systems , 2013, J. Parallel Distributed Comput..

[37]  Baris Tan,et al.  Markov chain test for time dependence and homogeneity: An analytical and empirical evaluation , 2002, Eur. J. Oper. Res..

[38]  T. W. Anderson,et al.  Statistical Inference about Markov Chains , 1957 .

[39]  Gunter Bolch,et al.  Queueing Networks and Markov Chains , 2005 .

[40]  Yannis A. Dimitriadis,et al.  Grid Characteristics and Uses: A Grid Definition , 2003, European Across Grids Conference.

[41]  Rajkumar Buyya,et al.  Managing Overloaded Hosts for Dynamic Consolidation of Virtual Machines in Cloud Data Centers under Quality of Service Constraints , 2013, IEEE Transactions on Parallel and Distributed Systems.