Un protocole de tolérance aux pannes pour objets actifs non préemptifs

RESUME. Les protocoles de points de reprises induits par messages semblent etre l’approche la plus adaptee aux applications s’executant sur des systemes heterogenes avec un faible taux de panne. Mais ces protocoles supposent qu’il soit toujours possible de prendre un point de reprise de maniere preemptive, avant la prise en compte d’un message.Nous proposons donc, dans le cadre d’un modele a objets actifs, un protocole de tolerance aux pannes par points de reprise induits par messages adapte a la non-preemptivite des processus. A la difference de nombreux protocoles existants, ce protocole assure la coherence forte des lignes de recouvrement formees, et permet une reprise completement asynchrone du systeme reparti en cas de panne.

[1]  Denis Caromel,et al.  Towards seamless computing and metacomputing in Java , 1998 .

[2]  Lorenzo Alvisi,et al.  An analysis of communication induced checkpointing , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[3]  D. Manivannan,et al.  A low-overhead recovery technique using quasi-synchronous checkpointing , 1996, Proceedings of 16th International Conference on Distributed Computing Systems.

[4]  Richard D. Schlichting,et al.  Fail-stop processors: an approach to designing fault-tolerant computing systems , 1983, TOCS.

[5]  Denis Caromel,et al.  Asynchronous and deterministic objects , 2004, POPL.

[6]  Eric Charles Cooper Replicated distributed programs (fault tolerance, communication protocols, operating systems, remote procedure call, computer networks) , 1985 .

[7]  Ten-Hwang Lai,et al.  On Distributed Snapshots , 1987, Inf. Process. Lett..

[8]  Denis Caromel,et al.  A Fault Tolerance protocol for ASP calculus: Design and Proof , 2004 .

[9]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[10]  L. Alvisi,et al.  A Survey of Rollback-Recovery Protocols , 2002 .

[11]  Lorenzo Alvisi,et al.  Reasons for a pessimistic or optimistic message logging protocol in MPI uncoordinated failure, recovery , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[12]  B. Bouteiller,et al.  MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[13]  Denis Caromel,et al.  A parallel object-oriented application for 3D electromagnetism , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[14]  Denis Caromel,et al.  A Simple Security-Aware MOP for Java , 2001, Reflection.

[15]  Augusto Ciuffoletti,et al.  A Distributed Domino-Effect free recovery Algorithm , 1984, Symposium on Reliability in Distributed Software and Database Systems.

[16]  Robert E. Strom,et al.  Optimistic recovery in distributed systems , 1985, TOCS.

[17]  Denis Caromel,et al.  Toward a method of object-oriented concurrent programming , 1993, CACM.

[18]  Eric C. Cooper Replicated distributed programs , 1985, SOSP 1985.

[19]  Sam Toueg,et al.  Unreliable failure detectors for reliable distributed systems , 1996, JACM.

[20]  Michel Raynal,et al.  Consistency Issues in Distributed Checkpoints , 1999, IEEE Trans. Software Eng..

[21]  Christian Delbé Causal Ordering of Asynchronous Request Services , .