Transparent Fault Tolerance Solution at Socket Level Based on RADIC
暂无分享,去创建一个
We present a transparent middleware for fault tolerance based on RADIC, Redundant Array of Distributed Independent Controllers, a transparent and scalable fault tolerant architecture for parallel applications. It is designed at socket level and makes a secure tunnel connection able to keep the tcp sessions established by the application in spite of node failures. It is located at user level and is independent of the message-passing communication library being used. The protection gets through uncoordinated checkpoints and log message and the recovery are done in a automatic way so in case of node failures there is no need of intervention of the administrator. We have tested our fault tolerance system by executing a master-worker (M/W) and SPMD applications that follow different communication patterns.
[1] William Gropp,et al. Fault Tolerance in Message Passing Interface Programs , 2004, Int. J. High Perform. Comput. Appl..
[2] Gene Cooperman,et al. DMTCP: Transparent checkpointing for cluster computations and the desktop , 2007, 2009 IEEE International Symposium on Parallel & Distributed Processing.
[3] Emilio Luque,et al. Challenges and Issues of the Integration of RADIC into Open MPI , 2009, PVM/MPI.
[4] Barton P. Miller,et al. Reliable network connections , 2002, MobiCom '02.