A Framework for High Availability Based on a Single System Image

High availability (HA) is today an important issue in the domain of cluster computing, clusters being more and more larger, introducing a lot of failures. Today, the literature provides a lot of different HA strategies to tolerate application failures (applications being sequential or parallel). Unfortunately, it is still difficult to implement these HA policies inside a real system, and therefore the study of these policies is most of the time just theoretic, without real implementation. Therefore, a framework to ease the implementation of such policies is interesting. Moreover, a single system image (SSI), thanks to mechanisms for the global management of cluster resources, is a good candidate to provide such a framework. This paper presents the preliminary study of this framework on top of the Kerrighed SSI. \\ La haute disponibilite est aujourd'hui un probleme important pour les grappes de calculateurs, ceux-ci ayant une taille de plus en plus grande, introduisant de nombreuses fautes. Pour cela, la litterature offre de nombreuses strategies permettant de tolerer les fautes d'applications (que les applications soient sequentielles ou paralleles). Malheureusement, la mise en \oe uvre de ces politiques de haute disponibilite est toujours difficile et leur etude est donc tres souvent limite a une etude theorique, sans reelle mise en oeuvre. Un environnement dedie simplifiant la mise en oeuvre de telles politiques est donc interesant. De plus, un Systeme a Image Unique (Single System Image - SSI), grâce a ses mecanismes de gestion globale des ressources de la grappe, est un bon candidat pour offrir un tel environnement. Ce document presente l'etude preliminaire d'un tel environnement fonde sur le SSI Kerrighed.

[1]  J. Duell The design and implementation of Berkeley Lab's linux checkpoint/restart , 2005 .

[2]  Christine Morin,et al.  Containers: a sound basis for a true single system image , 2001, Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid.

[3]  Christine Morin,et al.  A Case for Single System Image Cluster Operating Systems: The Kerrighed Approach , 2003, Parallel Process. Lett..

[4]  L. Alvisi,et al.  A Survey of Rollback-Recovery Protocols , 2002 .

[5]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[6]  Christine Morin,et al.  A Survey of Recoverable Distributed Shared Memory Systems , 1995 .

[7]  Christine Morin,et al.  Ghost Process: a Sound Basis to Implement Process Duplication, Migration and Checkpoint/Restart in Linux Clusters , 2005, The 4th International Symposium on Parallel and Distributed Computing (ISPDC'05).

[8]  Christine Morin,et al.  Kerrighed: A Single System Image Cluster Operating System for High Performance Computing , 2003, Euro-Par.

[9]  G. Tortone,et al.  OpenMosix approach to build scalable HPC farms with an easy management infrastructure , 2003 .

[10]  Andreas Speck Software Engineering (1) , 2006 .

[11]  Brian Randell,et al.  System structure for software fault tolerance , 1975, IEEE Transactions on Software Engineering.

[12]  Jason Duell,et al.  The design and implementation of Berkeley Lab's linuxcheckpoint/restart , 2005 .

[13]  Eduardo Pinheiro,et al.  Truly-Transparent Checkpointing of Parallel Applications , 1998 .

[14]  David L. Russell,et al.  State Restoration in Systems of Communicating Processes , 1980, IEEE Transactions on Software Engineering.

[15]  Christine Morin,et al.  Towards an efficient single system image cluster operating system , 2002, Fifth International Conference on Algorithms and Architectures for Parallel Processing, 2002. Proceedings..

[16]  Taesoon Park,et al.  Checkpointing and rollback-recovery in distributed systems , 1989 .

[17]  Christine Morin,et al.  OpenMosix, OpenSSI and Kerrighed: a comparative study , 2005, CCGrid 2005. IEEE International Symposium on Cluster Computing and the Grid, 2005..

[18]  Stephen L. Scott,et al.  HA-OSCAR: the birth of highly available OSCAR , 2003 .