Failover Pattern with a Self-Healing Mechanism for High Availability Cloud Solutions

Cloud computing has already been adopted in a broad range of application domains and has become an established building block in IT landscapes. During the process of cloud middleware development, the companies have focused mainly on the high availability of data and end-user services, but unfortunately neglected the availability of middleware components. Therefore failures of the middleware components itself usually leads to a partial or even total blackout of the cloud. In this paper, we present the design and implementation of a novel scalable and highly available multi-master pattern for cloud middlewares. In contrast to existing Infrastructure-as-a-Service cloud management frameworks, which are usually designed in a centralized tree topology composed in a three-tiered master worker architecture, we introduce a concept for a multi tree with all tree roots connected in a fully connected mesh topology. In this architecture user requests are load balanced over multiple failover servers. Furthermore, our concept includes an automatic self-healing mechanism for worker nodes of each tree.

[1]  Jamilson Dantas,et al.  An availability model for eucalyptus platform: An analysis of warm-standy replication mechanism , 2012, 2012 IEEE International Conference on Systems, Man, and Cybernetics (SMC).

[2]  Hoi Chan,et al.  An approach to high availability for cloud servers with snapshot mechanism , 2012, MIDDLEWARE '12.

[3]  Joseph M. Hellerstein,et al.  MapReduce Online , 2010, NSDI.

[4]  Dejan S. Milojicic,et al.  Eucalyptus: Delivering a Private Cloud , 2011, Computer.

[5]  Michael Treaster,et al.  A Survey of Fault-Tolerance and Fault-Recovery Techniques in Parallel Systems , 2004, ArXiv.

[6]  Odej Kao,et al.  Hardware as a Service (HaaS): Physical and virtual hardware on demand , 2012, 4th IEEE International Conference on Cloud Computing Technology and Science Proceedings.

[7]  John Paul Walters,et al.  A fault-tolerant strategy for virtualized HPC clusters , 2009, The Journal of Supercomputing.

[8]  Ravishankar K. Iyer,et al.  Toward a high availability cloud: Techniques and challenges , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012).

[9]  Christine Morin,et al.  Snooze: A Scalable and Autonomic Virtual Machine Management Framework for Private Clouds , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[10]  Poul E. Heegaard,et al.  Differentiated Availability in Cloud Computing SLAs , 2011, 2011 IEEE/ACM 12th International Conference on Grid Computing.

[11]  Yuanyuan Zhou,et al.  Fast cluster failover using virtual memory-mapped communication , 1999, ICS '99.

[12]  Richard Wolski,et al.  The Eucalyptus Open-Source Cloud-Computing System , 2009, 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid.

[13]  Vijayaraghavan Soundararajan,et al.  Challenges in building scalable virtualized datacenter management , 2010, OPSR.

[14]  J. Singh,et al.  High Availability of Clouds: Failover Strategies for Cloud Computing Using Integrated Checkpointing Algorithms , 2012, 2012 International Conference on Communication Systems and Network Technologies.

[15]  George Candea,et al.  Middleware-based database replication: the gaps between theory and practice , 2007, SIGMOD Conference.

[16]  Dutch T. Meyer,et al.  Remus: High Availability via Asynchronous Virtual Machine Replication. (Best Paper) , 2008, NSDI.

[17]  Laks V. S. Lakshmanan,et al.  Proceedings of the 2008 ACM SIGMOD international conference on Management of data , 2008, SIGMOD 2008.

[18]  Daniel A. Reed,et al.  NCSA's World Wide Web Server: Design and Performance , 1995, Computer.