Tolérance aux pannes pour objets actifs asynchrones : modèle, protocole et expérimentations. (Fault tolerance for asynchronous active objects : protocol, model and experiments)

Résumé 197 x TABLE DES MATIÈRES

[1]  David B. Johnson,et al.  Recovery in Distributed Systems Using Optimistic Message Logging and Checkpointing , 1988, J. Algorithms.

[2]  Shigeru Chiba,et al.  A metaobject protocol for C++ , 1995, OOPSLA.

[3]  Franco Zambonelli On the effectiveness of distributed checkpoint algorithms for domino-free recovery , 1998, Proceedings. The Seventh International Symposium on High Performance Distributed Computing (Cat. No.98TB100244).

[4]  V. Garg,et al.  Happened Before is the Wrong Model for Potential Causality , 1998 .

[5]  David B. Johnson,et al.  Efficient transparent optimistic rollback recovery for distributed application programs , 1993, Proceedings of 1993 IEEE 12th Symposium on Reliable Distributed Systems.

[6]  D. Manivannan,et al.  Asynchronous recovery without using vector timestamps , 2002, J. Parallel Distributed Comput..

[7]  Franck Cappello,et al.  Grid'5000: a large scale, reconfigurable, controlable and monitorable Grid platform , 2005 .

[8]  Denis Caromel,et al.  Efficient, flexible, and typed group communications in Java , 2002, JGI '02.

[9]  Kai Li,et al.  Libckpt: Transparent Checkpointing under UNIX , 1995, USENIX.

[10]  Pierre Sens,et al.  The performance of independent checkpointing in distributed systems , 1995, Proceedings of the Twenty-Eighth Annual Hawaii International Conference on System Sciences.

[11]  Gerard Tel,et al.  Synchronous, asynchronous, and causally ordered communication , 1996, Distributed Computing.

[12]  D. Manivannan,et al.  Quasi-Synchronous Checkpointing: Models, Characterization, and Classification , 1999, IEEE Trans. Parallel Distributed Syst..

[13]  Denis Caromel,et al.  Promised Consistency for Rollback Recovery , 2006 .

[14]  Denis Caromel,et al.  Asynchronous and deterministic objects , 2004, POPL.

[15]  Bruno Ciciani,et al.  A VP-accordant checkpointing protocol preventing useless checkpoints , 1998, Proceedings Seventeenth IEEE Symposium on Reliable Distributed Systems (Cat. No.98CB36281).

[16]  Achour Mostéfaoui,et al.  Communication-Induced Determination of Consistent Snapshots , 1999, IEEE Trans. Parallel Distributed Syst..

[17]  David B. Johnson,et al.  Sender-Based Message Logging , 1987 .

[18]  Denis Caromel,et al.  ProActive: an integrated platform for programming and running applications on Grids and P2P systems , 2006 .

[19]  David F. Bacon,et al.  Volatile logging in n-fault-tolerant distributed systems , 1988, [1988] The Eighteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[20]  Jian Xu,et al.  Necessary and Sufficient Conditions for Consistent Global Snapshots , 1995, IEEE Trans. Parallel Distributed Syst..

[21]  Denis Caromel,et al.  Balancing active objects on a peer to peer infrastructure , 2005, XXV International Conference of the Chilean Computer Science Society (SCCC'05).

[22]  Jason Duell,et al.  The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing , 2005, Int. J. High Perform. Comput. Appl..

[23]  Robert E. Strom,et al.  Optimistic recovery in distributed systems , 1985, TOCS.

[24]  F. Mattern On the Relativistic Structure of Logical Time in Distributed Systems , 2009 .

[25]  Carl E. Landwehr,et al.  Basic concepts and taxonomy of dependable and secure computing , 2004, IEEE Transactions on Dependable and Secure Computing.

[26]  Daniel Marques,et al.  Recent advances in checkpoint/recovery systems , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[27]  Leslie Lamport,et al.  Cheap Paxos , 2004, International Conference on Dependable Systems and Networks, 2004.

[28]  W. Kent Fuchs,et al.  Progressive retry for software error recovery in distributed systems , 1993, FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing.

[29]  Denis Caromel,et al.  Peer-to-peer for computational grids: mixing clusters and desktop machines , 2007, Parallel Comput..

[30]  Georg Stellner,et al.  CoCheck: checkpointing and process migration for MPI , 1996, Proceedings of International Conference on Parallel Processing.

[31]  Denis Caromel,et al.  Un protocole de tolérance aux pannes pour objets actifs non préemptifs , 2005, Tech. Sci. Informatiques.

[32]  Sy-Yen Kuo,et al.  An Efficient Time-Based Checkpointing Protocol for Mobile Computing Systems over Mobile IP , 2003, Mob. Networks Appl..

[33]  D. Manivannan,et al.  A low-overhead recovery technique using quasi-synchronous checkpointing , 1996, Proceedings of 16th International Conference on Distributed Computing Systems.

[34]  Roberto Baldoni,et al.  An Index-Based Checkpointing Algorithm for Autonomous Distributed Systems , 1999, IEEE Trans. Parallel Distributed Syst..

[35]  James R. Russell,et al.  Optimistic failure recovery for very large networks , 1991, [1991] Proceedings Tenth Symposium on Reliable Distributed Systems.

[36]  Song Jiang,et al.  Transparent, Incremental Checkpointing at Kernel Level: a Foundation for Fault Tolerance for Parallel Computers , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[37]  Laxmikant V. Kalé,et al.  FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).

[38]  Denis Caromel,et al.  A Simple Security-Aware MOP for Java , 2001, Reflection.

[39]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[40]  Denis Caromel,et al.  A Hybrid Message Logging-CIC Protocol for Constrained Checkpointability , 2005, Euro-Par.

[41]  John F. Karpovich,et al.  Support for extensibility and site autonomy in the Legion grid system object model , 2003, J. Parallel Distributed Comput..

[42]  Achour Mostéfaoui,et al.  Communication-based prevention of useless checkpoints in distributed computations , 2000, Distributed Computing.

[43]  Daniel Marques,et al.  Implementation and Evaluation of a Scalable Application-Level Checkpoint-Recovery Scheme for MPI Programs , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[44]  Cristina V. Lopes,et al.  Aspect-oriented programming , 1999, ECOOP Workshops.

[45]  N. Vaidya Distributed Recovery Units: An Approach for Hybrid and Adaptive Distributed Recovery , 1993 .

[46]  Achour Mostéfaoui,et al.  Characterization of consistent global checkpoints in large-scale distributed systems , 1995, Proceedings of the Fifth IEEE Computer Society Workshop on Future Trends of Distributed Computing Systems.

[47]  Jean-Charles Fabre,et al.  Using Compile-Time Reflection for Objects'State Capture , 1999, Reflection.

[48]  Sam Toueg,et al.  Unreliable failure detectors for reliable distributed systems , 1996, JACM.

[49]  Michel Raynal,et al.  Consistency Issues in Distributed Checkpoints , 1999, IEEE Trans. Software Eng..

[50]  Thomas Hérault,et al.  MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[51]  Thomas Hérault,et al.  Improved message logging versus improved coordinated checkpointing for fault tolerant MPI , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).

[52]  Roy Friedman,et al.  Virtual machine based heterogeneous checkpointing , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[53]  Jason Maassen,et al.  Ibis: a flexible and efficient Java‐based Grid programming environment , 2005, Concurr. Pract. Exp..

[54]  Augusto Ciuffoletti,et al.  A Distributed Domino-Effect free recovery Algorithm , 1984, Symposium on Reliability in Distributed Software and Database Systems.

[55]  Hiroshi Nakamura,et al.  Skewed checkpointing for tolerating multi-node failures , 2004, Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems, 2004..

[56]  Divyakant Agrawal,et al.  Using message semantics to reduce rollback in optimistic message logging recovery schemes , 1994, 14th International Conference on Distributed Computing Systems.

[57]  Achour Mostéfaoui,et al.  Preventing useless checkpoints in distributed computations , 1997, Proceedings of SRDS'97: 16th IEEE Symposium on Reliable Distributed Systems.

[58]  Shigeru Chiba,et al.  OpenJava: A Class-Based Macro System for Java , 1999, Reflection and Software Engineering.

[59]  Yin-Min Wang,et al.  Consistent Global checkpoints that Contain a Given Set of Local Chekpoints , 1997, IEEE Trans. Computers.

[60]  Sara Bouchenak,et al.  Pickling threads state in the Java system , 2000, Proceedings 33rd International Conference on Technology of Object-Oriented Languages and Systems TOOLS 33.

[61]  Swaroop Sridhar,et al.  A POLL-FREE, LOW-LATENCY APPROACH TO PROCESS STATE CAPTURE / RECOVERY IN HETEROGENEOUS COMPUTING SYSTEMS , 2002 .

[62]  Willy Zwaenepoel,et al.  The performance of consistent checkpointing , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[63]  Robert Tappan Morris,et al.  Ivy: a read/write peer-to-peer file system , 2002, OSDI '02.

[64]  Nitin H. Vaidya,et al.  Staggered Consistent Checkpointing , 1999, IEEE Trans. Parallel Distributed Syst..

[65]  Vijay K. Garg,et al.  Addressing false causality while detecting predicates in distributed programs , 1998, Proceedings. 18th International Conference on Distributed Computing Systems (Cat. No.98CB36183).

[66]  Christian Delbé Causal Ordering of Asynchronous Request Services , .

[67]  Daniel Marques,et al.  C3: A System for Automating Application-Level Checkpointing of MPI Programs , 2003, LCPC.

[68]  Lorenzo Alvisi,et al.  Causality tracking in causal message-logging protocols , 2002, Distributed Computing.

[69]  Denis Caromel,et al.  A theory of distributed objects - asynchrony, mobility, groups, components , 2005 .

[70]  Harrick M. Vin,et al.  The Cost of Recovery in Message Logging Protocols , 2000, IEEE Trans. Knowl. Data Eng..

[71]  Wouter Joosen,et al.  Portable Support for Transparent Thread Migration in Java , 2000, ASA/MA.

[72]  F. Cappello,et al.  Blocking vs. Non-Blocking Coordinated Checkpointing for Large-Scale Fault Tolerant MPI , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[73]  Friedemann Mattern,et al.  Virtual Time and Global States of Distributed Systems , 2002 .

[74]  Souza dos Santos Persistent Java , 1996 .

[75]  Luís Moura Silva,et al.  System-level versus user-defined checkpointing , 1998, Proceedings Seventeenth IEEE Symposium on Reliable Distributed Systems (Cat. No.98CB36281).

[76]  Luís Moura Silva,et al.  The performance of coordinated and independent checkpointing , 1999, Proceedings 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing. IPPS/SPDP 1999.

[77]  Pierre Sens,et al.  DARX - a framework for the fault-tolerant support of agent software , 2003, 14th International Symposium on Software Reliability Engineering, 2003. ISSRE 2003..

[78]  Denis Caromel,et al.  A Fault Tolerance protocol for ASP calculus: Design and Proof , 2004 .

[79]  Luís Moura Silva,et al.  Using message semantics for fast-output commit in checkpointing-and-rollback recovery , 1999, Proceedings of the 32nd Annual Hawaii International Conference on Systems Sciences. 1999. HICSS-32. Abstracts and CD-ROM of Full Papers.

[80]  E. N. Elnozahy,et al.  Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery , 2004, IEEE Transactions on Dependable and Secure Computing.

[81]  Michel Raynal,et al.  Fundamentals of Distributed Computing: A Practical Tour of Vector Clock Systems , 2002, IEEE Distributed Syst. Online.

[82]  Sacha Krakowiak,et al.  Experiences implementing efficient Java thread serialization, mobility and persistence , 2004, Softw. Pract. Exp..

[83]  Douglas Thain,et al.  Distributed computing in practice: the Condor experience , 2005, Concurr. Pract. Exp..

[84]  Brian Randell,et al.  System structure for software fault tolerance , 1975, IEEE Transactions on Software Engineering.

[85]  Willy Zwaenepoel,et al.  Manetho: Transparent Rollback-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit , 1992, IEEE Trans. Computers.

[86]  Denis Conan,et al.  Tolerance aux fautes par recouvrement arriere dans les systemes informatiques repartis , 1996 .

[87]  Jim Waldo,et al.  A Note on Distributed Computing , 1996, Mobile Object Systems.

[88]  Laxmikant V. Kalé,et al.  CHARM++: a portable concurrent object oriented system based on C++ , 1993, OOPSLA '93.

[89]  Lorenzo Alvisi,et al.  An analysis of communication induced checkpointing , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[90]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[91]  Vijay K. Garg,et al.  Debugging distributed programs using controlled re-execution , 2000, PODC '00.

[92]  Steven J. Deitz,et al.  Compiler support for automatic checkpointing , 2002, Proceedings 16th Annual International Symposium on High Performance Computing Systems and Applications.

[93]  Anne-Marie Kermarrec,et al.  Peer-to-Peer Membership Management for Gossip-Based Protocols , 2003, IEEE Trans. Computers.

[94]  A. Prasad Sistla,et al.  Efficient distributed recovery using message logging , 1989, PODC '89.

[95]  Vijay K. Garg,et al.  Optimistic recovery in multi-threaded distributed systems , 1999, Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems.

[96]  Ami Marowka,et al.  The GRID: Blueprint for a New Computing Infrastructure , 2000, Parallel Distributed Comput. Pract..

[97]  Marvin Theimer,et al.  Feasibility of a serverless distributed file system deployed on an existing set of desktop PCs , 2000, SIGMETRICS '00.

[98]  L. Alvisi,et al.  A Survey of Rollback-Recovery Protocols , 2002 .

[99]  Lorenzo Alvisi,et al.  Reasons for a pessimistic or optimistic message logging protocol in MPI uncoordinated failure, recovery , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.