论文信息 - Fault Tolerant Master-Worker over a Multi-Cluster Architecture

Fault Tolerant Master-Worker over a Multi-Cluster Architecture

The growth of clusters into cluster collections increases potential points of failures, requiring the implementation of a fault-tolerance scheme. The CoHNOW is organized as a hierarchical master-worker scheme and clusters may be geographically distributed and interconnected by Internet. This paper describes a system of Fault-Tolerant protection by Data Replication (FT-DR), based on preserving critical functions by on-line dynamic data replication. The system-model target is to detect failures in any of the system functional elements and to tolerate this failure by recovering system consistency, guaranteeing the completion of the work in progress (recovery procedure). The model is designed to tolerate more than one simultaneous failure. There are three distinct phases for model-fault tolerance activities: startup, normal execution including failure detection monitoring, and failure recovery. The system is oriented for general master-worker applications running on CoHNOW and is transparent both for user and application. The master-worker environment requirements to support all these capabilities and the runtime overhead are under evaluation.

Emilio Luque | Dolores Rexachs | Eduardo Argollo | Angelo Duarte | J. Rodrigues de Souza

[1] Emilio Luque,et al. Architectures for an Efficient Application Execution in a Collection of HNOWS , 2002, PVM/MPI.

[2] Emilio Luque,et al. Efficient Execution on Long-Distance Geographically Distributed Dedicated Clusters , 2004, PVM/MPI.

[3] Jon B. Weissman. Fault Tolerant Wide-Area Parallel Computing , 2000, IPDPS Workshops.

[4] Yves Robert,et al. Matrix Multiplication on Heterogeneous Platforms , 2001, IEEE Trans. Parallel Distributed Syst..

[5] Thomas Hérault,et al. MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes , 2002, ACM/IEEE SC 2002 Conference (SC'02).