Computing Defects per Million in Cloud Caused by Virtual Machine Failures with Replication

Virtual machines (VM) are used in cloud computing systems to handle user requests for service. A typical user request goes through several cloud service provider specific processing steps from the instant it is submitted until the service is completed. In the process of providing the service, VM failures cause the user's request to be dropped. To mitigate the adverse impact of VM failure, replication mechanisms, either using cold, warm or hot replication, can be used. In this paper, we model the system behavior with a structure-state process to characterize the failure-recovery behavior of a VM in a cloud that uses one of the aforementioned replication schemes. We use a service-oriented dependability metric called Defects Per Million (DPM), defined as the number of user requests dropped out of a million. The structure-state process approach is used to analyze the job completion time distribution and subsequently we compute the DPM by counting the number of requests exceed the specified deadline. The effectiveness of replication schemes are demonstrated through numerical results.

[1]  Dong Seong Kim,et al.  Modeling and analysis of software rejuvenation in a server virtualized system with live VM migration , 2013, Perform. Evaluation.

[2]  Aniruddha S. Gokhale,et al.  DOORS: towards high-performance fault tolerant CORBA , 2000, Proceedings DOA'00. International Symposium on Distributed Objects and Applications.

[3]  Kishor S. Trivedi,et al.  Fighting bugs: remove, retry, replicate, and rejuvenate , 2007, Computer.

[4]  Eric Bauer,et al.  Reliability and Availability of Cloud Computing , 2012 .

[5]  Phillip C. Howard Introduction to Performance Management , 1983, Int. CMG Conference.

[6]  Eric Bauer,et al.  Reliability and Availability of Cloud Computing: Bauer/Cloud Computing , 2012 .

[7]  Jelena V. Misic,et al.  A Fine-Grained Performance Model of Cloud Computing Centers , 2013, IEEE Transactions on Parallel and Distributed Systems.

[8]  Kishor S. Trivedi,et al.  Job completion time on a virtualized server with software rejuvenation , 2014, ACM J. Emerg. Technol. Comput. Syst..

[9]  Kishor S. Trivedi,et al.  On modelling the performance and reliability of multimode computer systems , 1986, J. Syst. Softw..

[10]  Kishor S. Trivedi,et al.  Modeling and performance analysis of large scale IaaS Clouds , 2013, Future Gener. Comput. Syst..

[11]  Kishor S. Trivedi,et al.  Performance and reliability evaluation of passive replication schemes in application level fault tolerance , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[12]  Andrzej Duda,et al.  The Effects of Checkpointing on Program Execution Time , 1983, Inf. Process. Lett..

[13]  G. V. Kulkarni,et al.  The Completion Time of a Job on Multi-Mode Systems , 1985 .

[14]  Yaakov Kogan,et al.  VoIP reliability: a service provider's perspective , 2004, IEEE Communications Magazine.

[15]  Kishor S. Trivedi,et al.  Defects per Million Computation in Service-Oriented Environments , 2015, IEEE Transactions on Services Computing.

[16]  Victor F. Nicola,et al.  Checkpointing and the modeling of program execution time , 1994 .

[17]  Deron Liang,et al.  NT-SwiFT: software implemented fault tolerance on Windows NT , 2004, J. Syst. Softw..

[18]  Carl E. Landwehr,et al.  Basic concepts and taxonomy of dependable and secure computing , 2004, IEEE Transactions on Dependable and Secure Computing.

[19]  S K Trivedi,et al.  On Modeling the Performance and Reliability of Multi-Mode Computer Systems , 1984 .

[20]  Kishor S. Trivedi,et al.  Computing the Number of Calls Dropped Due to Failures , 2010, 2010 IEEE 21st International Symposium on Software Reliability Engineering.

[21]  AvizienisAlgirdas,et al.  Basic Concepts and Taxonomy of Dependable and Secure Computing , 2004 .