High availability for parallel computers

Fault tolerance has become an important issue for parallel applications in the last few years. The parallel systems' users want them to be reliable considering two main dimensions, availability and data consistency. Availability can be provided with solutions such as RADIC, a fault tolerant architecture with different protection levels, offering high availability with transparency, decentralization, flexibility and scalability for message-passing systems. Transient faults may cause an application running in a computer system to be removed from execution, however the biggest risk of transient faults is to provoke undetected data corruption that changes the final result of the application without anyone knowing. To evaluate the effects of transient faults in the robustness of applications and validate new fault detection mechanism and strategies, we have developed a full-system simulation fault injection environment

[1]  Thomas Hérault,et al.  MPICH-V Project: A Multiprotocol Automatic Fault-Tolerant MPI , 2006, Int. J. High Perform. Comput. Appl..

[2]  L. Alvisi,et al.  A Survey of Rollback-Recovery Protocols , 2002 .

[3]  Emilio Luque,et al.  An Intelligent Management of Fault Tolerance in Cluster Using RADICMPI , 2006, PVM/MPI.

[4]  V. Rajaraman,et al.  A survey of checkpointing algorithms for parallel and distributed computers , 2000 .

[5]  Emilio Luque,et al.  Increasing the cluster availability using RADIC , 2006, 2006 IEEE International Conference on Cluster Computing.

[6]  Emilio Luque Fadón,et al.  Outcomes of the fault tolerance configuration , 2009 .

[7]  Paolo Faraboschi,et al.  COTSon: infrastructure for full system simulation , 2009, OPSR.

[8]  Emilio Luque,et al.  Challenges and Issues of the Integration of RADIC into Open MPI , 2009, PVM/MPI.

[9]  Laxmikant V. Kale,et al.  Proactive Fault Tolerance in Large Systems , 2004 .

[10]  Christian Engelmann,et al.  Development of Naturally Fault Tolerant Algorithms for Computing on 100,000 Processors , 2002 .

[11]  Emilio Luque,et al.  Providing Non-stop Service for Message-Passing Based Parallel Applications with RADIC , 2008, Euro-Par.

[12]  Richard P. Martin,et al.  Quantifying the performability of cluster-based services , 2005, IEEE Transactions on Parallel and Distributed Systems.

[13]  William Gropp,et al.  Fault Tolerance in Message Passing Interface Programs , 2004, Int. J. High Perform. Comput. Appl..

[14]  Joel S. Emer,et al.  The soft error problem: an architectural perspective , 2005, 11th International Symposium on High-Performance Computer Architecture.