Availability modeling and analysis of a data center for disaster tolerance

Availability assessment of a data center with disaster tolerance (DT) is demanding for cloud computing based businesses. Previous work attempted to model and analyze the computing systems without a good consideration on disaster occurrence, unexpected failure of network connection and proper dependencies between subsystems in a data center. This paper presents a comprehensive availability model of a data center for DT using stochastic reward nets (SRN). The model incorporates (i) a typical two-level high availability (HA) configuration (i.e., active/active between sites and active/passive within a site), (ii) various fault and disaster tolerant techniques; (iii) dependencies between subsystems (e.g. between a host and virtual machines (VMs), between a network area storage (NAS) and VMs) and dependency between disastrous events and physical subsystems; and (iv) unexpected failures during data transmission between data centers. The constructed SRN model is analyzed on the basis of steady state analysis, downtime cost analysis, and sensitivity analysis with regard to major impacting parameters. The analysis results show the availability improvement of the disaster tolerant data center (DTDC) and featured system responses corresponding to the selected variables. The modeling and analysis of the DTDC in this paper provide a selection basis of designing for disasters in consideration of the trade-off between system availability and downtime cost with infrastructure construction cost. We present a comprehensive availability modeling and analysis of a data center system for disaster tolerance.We assess availability characteristics of a data center regarding disaster occurrence, unexpected failure of network connection and complicated dependencies.The study reflects significance of the incorporation of disaster and fault tolerant techniques into geographically distributed data centers for high availability of cloud based businesses.The study provides a selection basis of designing for disasters considering the trade-off between system availability and downtime cost with infrastructure construction cost.

[1]  Thandar Thein,et al.  Availability Improvement in Virtualized Multiple Servers with Software Rejuvenation and Virtualization , 2010, 2010 Fourth International Conference on Secure Software Integration and Reliability Improvement.

[2]  John L. Hennessy,et al.  The Future of Systems Research , 1999, Computer.

[3]  Jianxin Li,et al.  Software Aging in Virtualized Environments: Detection and Prediction , 2012, 2012 IEEE 18th International Conference on Parallel and Distributed Systems.

[4]  Kishor S. Trivedi,et al.  The fundamentals of software aging , 2008, 2008 IEEE International Conference on Software Reliability Engineering Workshops (ISSRE Wksp).

[5]  Hugh Davis,et al.  Data center topologies for mission-critical business systems , 2008, IBM Syst. J..

[6]  Yuan Dong,et al.  RFS: a network file system for mobile devices and the cloud , 2011, OPSR.

[7]  Kishor S. Trivedi,et al.  Non‐Markovian State‐Space Models in Dependability Evaluation , 2013, Qual. Reliab. Eng. Int..

[8]  Yong Li,et al.  SDN-based live VM migration across datacenters , 2014, SIGCOMM.

[9]  Dirk Beyer,et al.  Designing for Disasters , 2004, FAST.

[10]  Kishor S. Trivedi,et al.  Stochastic Reward Nets for Reliability Prediction , 1996 .

[11]  Yingjie Yang,et al.  A Preliminary Research and Implementation of a Hierarchical High Availability Network Disaster-Tolerant System , 2011, 2011 International Conference on Computational and Information Sciences.

[12]  Kishor S. Trivedi,et al.  Availability analysis of blade server systems , 2008, IBM Syst. J..

[13]  Jong Sou Park,et al.  Disaster Recovery for System Architecture Using Cloud Computing , 2010, 2010 10th IEEE/IPSJ International Symposium on Applications and the Internet.

[14]  Kishor S. Trivedi,et al.  Performability Evaluation of Grid Environments Using Stochastic Reward Nets , 2015, IEEE Transactions on Dependable and Secure Computing.

[15]  Paulo Romero Martins Maciel,et al.  Dependability models for designing disaster tolerant cloud computing systems , 2013, 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[16]  Harriet Morrill,et al.  Achieving continuous availability of IBM systems infrastructures , 2008, IBM Syst. J..

[17]  Kishor S. Trivedi,et al.  Software Faults, Software Aging and Software Rejuvenation( New Development of Software Reliability Engineering) , 2005 .

[18]  Kishor S. Trivedi,et al.  Extended Classification of Software Faults Based on Aging , 2001 .

[19]  M.A. Thornton,et al.  Techniques for Disaster Tolerant Information Technology Systems , 2007, 2007 1st Annual IEEE Systems Conference.

[20]  Kishor S. Trivedi,et al.  Modeling and performance analysis of large scale IaaS Clouds , 2013, Future Gener. Comput. Syst..

[21]  Daniel P. Siewiorek,et al.  High-availability computer systems , 1991, Computer.

[22]  Paulo Romero Martins Maciel,et al.  Performability models for designing disaster tolerant Infrastructure-as-a-Service cloud computing systems , 2013, 8th International Conference for Internet Technology and Secured Transactions (ICITST-2013).

[23]  Dong Seong Kim,et al.  End-to-End Performability Analysis for Infrastructure-as-a-Service Cloud: An Interacting Stochastic Models Approach , 2010, 2010 IEEE 16th Pacific Rim International Symposium on Dependable Computing.

[24]  Jianwen Xiang,et al.  Combined Server Rejuvenation in a Virtualized Data Center , 2012, 2012 9th International Conference on Ubiquitous Intelligence and Computing and 9th International Conference on Autonomic and Trusted Computing.

[25]  Hai Jin,et al.  Live virtual machine migration with adaptive, memory compression , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[26]  Luís Moura Silva,et al.  Using machine learning for non-intrusive modeling and prediction of software aging , 2008, NOMS 2008 - 2008 IEEE Network Operations and Management Symposium.

[27]  E. S. Pilli,et al.  Live virtual machine migration techniques: Survey and research challenges , 2013, 2013 3rd IEEE International Advance Computing Conference (IACC).

[28]  Kishor S. Trivedi,et al.  System availability with non-exponentially distributed outages , 2002, IEEE Trans. Reliab..

[29]  Kishor S. Trivedi,et al.  A Classification of Software Faults , 2011 .

[30]  Dong Seong Kim,et al.  System availability assessment using stochastic models , 2013 .

[31]  Jong Sou Park,et al.  Availability Analysis of Application Servers Using Software Rejuvenation and Virtualization , 2009, Journal of Computer Science and Technology.

[32]  S. M. Botulinskiy,et al.  Improving Microsoft Hyper-V Live Migration efficiency over distance , 2011, 2011 21st International Crimean Conference "Microwave & Telecommunication Technology".

[33]  Eli M. Dow,et al.  Leveraging virtualization to optimize high-availability system configurations , 2008, IBM Syst. J..

[34]  Paula Ta-Shma,et al.  Using virtualization for high availability and disaster recovery , 2009, IBM J. Res. Dev..

[35]  Wei Luo,et al.  A Live Migration Strategy for Virtual Machine Based on Performance Predicting , 2012, 2012 International Conference on Computer Science and Service System.

[36]  Jianwen Xiang,et al.  Composing hierarchical stochastic model from SysML for system availability analysis , 2013, 2013 IEEE 24th International Symposium on Software Reliability Engineering (ISSRE).

[37]  Mitchell A. THORNTON,et al.  Disaster Tolerant Computer and Communication Systems , 2005 .

[38]  Kishor S. Trivedi,et al.  A scalable availability model for Infrastructure-as-a-Service cloud , 2011, 2011 IEEE/IFIP 41st International Conference on Dependable Systems & Networks (DSN).

[39]  S.A. Szygenda,et al.  Components of Continuous IT Availability & Disaster Tolerant Computing: 2007 IEEE Conference on Technologies for Homeland Security: Enhancing Critical Infrastructure Dependability , 2007, 2007 IEEE Conference on Technologies for Homeland Security.

[40]  Joseph H. Saleh,et al.  Survivability and Resiliency of Spacecraft and Spac e-Based Networks: a Framework for Characterization and Analysis , 2008 .

[41]  Shrisha Rao,et al.  Optimizing live migration of virtual machines across wide area networks using integrated replication and scheduling , 2011, 2011 IEEE International Systems Conference.

[42]  Kishor S. Trivedi,et al.  An empirical investigation of fault types in space mission system software , 2010, 2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN).

[43]  Jin B. Hong,et al.  Availability Modeling and Analysis of a Virtualized System , 2009, 2009 15th IEEE Pacific Rim International Symposium on Dependable Computing.

[44]  Feng Liu,et al.  Live virtual machine migration based on improved pre-copy approach , 2010, 2010 IEEE International Conference on Software Engineering and Service Sciences.

[45]  Kishor S. Trivedi,et al.  Fighting bugs: remove, retry, replicate, and rejuvenate , 2007, Computer.

[46]  Dong Seong Kim,et al.  A Comprehensive Availability Modeling and Analysis of a Virtualized Servers System Using Stochastic Reward Nets , 2014, TheScientificWorldJournal.

[47]  Dong Seong Kim,et al.  System resiliency quantification using non-state-space and state-space analytic models , 2013, Reliab. Eng. Syst. Saf..

[48]  T.T.Lwin,et al.  High Availability Cluster System for Local Disaster Recovery with Markov Modeling Approach , 2009, 0912.1835.

[49]  Dave Clitherow,et al.  Combining high availability and disaster recovery solutions for critical IT environments , 2008, IBM Syst. J..

[50]  Kishor S. Trivedi,et al.  SPNP: stochastic Petri net package , 1989, Proceedings of the Third International Workshop on Petri Nets and Performance Models, PNPM89.

[51]  Jong Sou Park,et al.  Availability Analysis and Improvement of Software Rejuvenation Using Virtualization , 2007 .

[52]  Mohit Tawarmalani,et al.  Performance Sensitive Replication in Geo-distributed Cloud Datastores , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[53]  Jian Xu,et al.  Availability Modeling and Analysis of a Single-Server Virtualized System with Rejuvenation , 2014, J. Softw..

[54]  Malgorzata Steinder,et al.  Server virtualization in autonomic management of heterogeneous workloads , 2007, Integrated Network Management.

[55]  R. Katz,et al.  US billion-dollar weather and climate disasters: data sources, trends, accuracy and biases , 2013, Natural Hazards.

[56]  Mitchell A. THORNTON,et al.  IT Application Downtime , Executive Visibility and Disaster Tolerant Computing , 2005 .

[57]  HE De-quan A Survey on Disaster Backup and Recovery Techniques , 2005 .

[58]  Mitchell A. Thornton,et al.  Components and Analysis of Disaster Tolerant Computing , 2007, 2007 IEEE International Performance, Computing, and Communications Conference.

[59]  Xiaohong Jiang,et al.  Live Migration of Multiple Virtual Machines with Resource Reservation in Cloud Computing Environments , 2011, 2011 IEEE 4th International Conference on Cloud Computing.

[60]  Jong Sou Park,et al.  Availability Modeling and Analysis on Virtualized Clustering with Rejuvenation , 2008 .

[61]  Fan Ying,et al.  A new live virtual machine migration strategy , 2012, 2012 International Symposium on Information Technologies in Medicine and Education.

[62]  Chuang Lin,et al.  Model-Driven Dependability Analysis of Virtualization Systems , 2009, 2009 Eighth IEEE/ACIS International Conference on Computer and Information Science.

[63]  Dong Seong Kim,et al.  Modeling and analysis of software rejuvenation in a server virtualized system with live VM migration , 2013, Perform. Evaluation.

[64]  Kishor S. Trivedi,et al.  Survivability quantification of communication services , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[65]  Jürgen M. Schneider,et al.  From high availability and disaster recovery to business continuity solutions , 2008, IBM Syst. J..

[66]  Montri Wiboonrat An Optimal Data Center Availability and Investment Trade-Offs , 2008, 2008 Ninth ACIS International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing.

[67]  Mohamed Cheriet,et al.  Decreasing live virtual machine migration down-time using a memory page selection based on memory change PDF , 2010, 2010 International Conference on Networking, Sensing and Control (ICNSC).

[68]  P. G. J. Leelipushpam,et al.  Live VM migration techniques in cloud environment — A survey , 2013, 2013 IEEE CONFERENCE ON INFORMATION AND COMMUNICATION TECHNOLOGIES.

[69]  Diana EASTON,et al.  A Methodology for Disaster Tolerance Utilizing the Concepts of Axiomatic Design , 2008 .

[70]  Dong Seong Kim,et al.  Modeling and analysis of software rejuvenation in a server virtualized system , 2010, 2010 IEEE Second International Workshop on Software Aging and Rejuvenation.

[71]  Kishor S. Trivedi,et al.  A Performability Modeling Framework Considering Service Components Deployment , 2012 .