Sliding Substitution of Failed Nodes

This paper considers the questions of how spare nodes should be allocated, how to substitute them for faulty nodes, and how much the communication performance is affected by such a substitution. The third question stems from the modification of the rank mapping by node substitutions, which can incur additional message collisions. In a stencil computation, rank mapping is done in a straightforward way on a Cartesian network without incurring any message collisions. However, once a substitution has occurred, the node- rank mapping may be destroyed. Therefore, these questions must be answered in a way that minimizes the degradation of communication performance. In this paper, several spare-node allocation and nodesubstitution methods will be proposed, analyzed, and compared in terms of communication performance following the substitution. It will be shown that when a failure occurs, the peer-to-peer (P2P) communication performance on the K computer can be slowed by a factor of three and collective performance can be cut in half. On BG/Q, P2P performance can be slowed by a factor of five and collective performance can be slowed by a factor of ten. However, those numbers can be reduced by using an appropriate substitution method.

[1]  Yoshio Tanaka,et al.  Scalable and Highly Available Fault Resilient Programming Middleware for Exascale Computing , 2014 .

[2]  Susan Coghlan,et al.  The Influence of Operating Systems on the Performance of Collective Operations at Extreme Scale , 2006, 2006 IEEE International Conference on Cluster Computing.

[3]  Naoyuki Shida,et al.  MPI Library and Low-Level Communication on the K computer , 2012 .

[4]  Hui Liu,et al.  High performance linpack benchmark: a fault tolerant implementation without checkpointing , 2011, ICS '11.

[5]  Franck Cappello,et al.  Toward Exascale Resilience: 2014 update , 2014, Supercomput. Front. Innov..

[6]  Ravishankar K. Iyer,et al.  Lessons Learned from the Analysis of System Failures at Petascale: The Case of Blue Waters , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[7]  Ibm Redbooks IBM System Blue Gene Solution: Blue Gene/Q System Administration , 2012 .

[8]  Rajeev Thakur,et al.  Analysis of topology-dependent MPI performance on Gemini networks , 2013, EuroMPI.

[9]  angesichts der Corona-Pandemie,et al.  UPDATE , 1973, The Lancet.

[10]  Thomas Hérault,et al.  Post-failure recovery of MPI communication capability , 2013, Int. J. High Perform. Comput. Appl..

[11]  Michael A. Heroux,et al.  Toward Local Failure Local Recovery Resilience Model using MPI-ULFM , 2014, EuroMPI/ASIA.

[12]  Andrew A. Chien,et al.  Fault Tolerance in an Inner-Outer Solver: A GVR-Enabled Case Study , 2014, VECPAR.

[13]  Ligang Hou,et al.  Comparison Research between XY and Odd-Even Routing Algorithm of a 2-Dimension 3X3 Mesh Topology Network-on-Chip , 2009, 2009 WRI Global Congress on Intelligent Systems.

[14]  Franck Cappello,et al.  Toward Exascale Resilience , 2009, Int. J. High Perform. Comput. Appl..

[15]  Zizhong Chen,et al.  Algorithm-Based Fault Tolerance for Fail-Stop Failures , 2008, IEEE Transactions on Parallel and Distributed Systems.

[16]  Kai Li,et al.  CLIP: A Checkpointing Tool for Message Passing Parallel Programs , 1997, ACM/IEEE SC 1997 Conference (SC'97).

[17]  Satoshi Matsuoka,et al.  Design and modeling of a non-blocking checkpointing system , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[18]  Satoshi Matsuoka,et al.  Fail-in-Place Network Design: Interaction Between Topology, Routing Algorithm and Failures , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[19]  Shinji Sumimoto The MPI Communication Library for the K Computer: Its Design and Implementation , 2012, EuroMPI.