A Checkpoint/Recovery Model for Heterogeneous Dataflow Computations Using Work-Stealing
暂无分享,去创建一个
Axel W. Krings | Thierry Gautier | Jean-Louis Roch | Samir Jafar | T. Gautier | A. Krings | Jean-Louis Roch | S. Jafar
[1] Leslie Lamport,et al. Distributed snapshots: determining global states of distributed systems , 1985, TOCS.
[2] Rémi Revire. Ordonnancement de graphe dynamique de tâches sur architecture de grande taille , 2004 .
[3] Laxmikant V. Kalé,et al. A fault tolerant protocol for massively parallel systems , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..
[4] Matteo Frigo,et al. The implementation of the Cilk-5 multithreaded language , 1998, PLDI.
[5] André Schiper,et al. A Systematic Classification of Replicated Database Protocols based on Atomic Broadcast , 1999 .
[6] Michael A. Bender,et al. Online Scheduling of Parallel Programs on Heterogeneous Systems with Applications to Cilk , 2002, SPAA '00.
[7] Miron Livny,et al. Checkpoint and Migration of UNIX Processes in the Condor Distributed Processing System , 1997 .
[8] Laxmikant V. Kalé,et al. FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).
[9] Georg Stellner,et al. CoCheck: checkpointing and process migration for MPI , 1996, Proceedings of International Conference on Parallel Processing.
[10] Gerson G. H. Cavalheiro,et al. Athapascan-1: On-line building data flow graph in a parallel language , 1998, Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.98EX192).
[11] B. Bouteiller,et al. MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging , 2003, ACM/IEEE SC 2003 Conference (SC'03).
[12] Theo Ungerer,et al. Asynchrony in Parallel Computing: From Dataflow to Multithreading , 2001, Scalable Comput. Pract. Exp..
[13] Robert E. Strom,et al. Optimistic recovery in distributed systems , 1985, TOCS.
[14] Sébastien Varrette,et al. Using data-flow analysis for resilience and result checking in peer-to-peer computations , 2004, Proceedings. 15th International Workshop on Database and Expert Systems Applications, 2004..
[15] Kishor S. Trivedi. Probability and Statistics with Reliability, Queuing, and Computer Science Applications , 1984 .
[16] Anh Nguyen-Tuong,et al. Exploiting data-flow for fault-tolerance in a wide-area parallel system , 1996, Proceedings 15th Symposium on Reliable Distributed Systems.
[17] Achour Mostéfaoui,et al. A communication-induced checkpointing protocol that ensures rollback-dependency trackability , 1997, Proceedings of IEEE 27th International Symposium on Fault Tolerant Computing.
[18] Ronald L. Graham,et al. Bounds on Multiprocessing Timing Anomalies , 1969, SIAM Journal of Applied Mathematics.
[19] Lorenzo Alvisi,et al. Reasons for a pessimistic or optimistic message logging protocol in MPI uncoordinated failure, recovery , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.