Grid-Oriented Process Clustering System for Partial Message Logging

In a computer cluster composed of many nodes, the mean time between failures becomes shorter as the number of nodes increases. This may mean that lengthy tasks cannot be performed, because they will be interrupted by failure. Therefore, fault tolerance has become an essential part of high-performance computing. Partial message logging forms clusters of processes, and coordinates a series of checkpoints to log messages between groups. Our study proposes a system of two features to improve the efficiency of partial message logging: 1) the communication log used in the clustering is recorded at runtime, and 2) a graph partitioning algorithm reduces the complexity of the system by geometrically partitioning a grid graph. The proposed system is evaluated by executing a scientific application. The results of process clustering are compared to existing methods in terms of the clustering performance and quality.

[1]  Li Chen,et al.  Parallel Finite Element Analysis Platform for the Earth Simulator: GeoFEM , 2003, International Conference on Computational Science.

[2]  Fumiyoshi Shoji,et al.  The K computer: Japanese next-generation supercomputer development project , 2011, IEEE/ACM International Symposium on Low Power Electronics and Design.

[3]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[4]  Franck Cappello,et al.  On the Use of Cluster-Based Partial Message Logging to Improve Fault Tolerance for MPI HPC Applications , 2011, Euro-Par.

[5]  Vipin Kumar,et al.  A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..

[6]  John M. Dennis,et al.  The CGPOP Miniapp , Version 1 . 0 , 2011 .

[7]  Franck Cappello,et al.  Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic MPI Applications , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[8]  Laxmikant V. Kalé,et al.  Team-Based Message Logging: Preliminary Results , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[9]  Dafang Zhang,et al.  Trading off logging overhead and coordinating overhead to achieve efficient rollback recovery , 2009, Concurr. Comput. Pract. Exp..

[10]  Franck Cappello,et al.  Fault Tolerance in Petascale/ Exascale Systems: Current Knowledge, Challenges and Research Opportunities , 2009, Int. J. High Perform. Comput. Appl..

[11]  Jean Roman,et al.  SCOTCH: A Software Package for Static Mapping by Dual Recursive Bipartitioning of Process and Architecture Graphs , 1996, HPCN Europe.