A hierarchical watchdog mechanism for systemic fault awareness on distributed systems

Systemic fault tolerance is usually pursued with a number of strategies, like redundancy and checkpoint/restart; any of them needs to be triggered by safe and fast fault detection. We devised a hardware/software approach to fault detection that enables a system-level Fault Awareness by implementing a hierarchical Mutual Watchdog. It relies on an improved high performance Network Interface Card (NIC), implementing an n -dimensional mesh topology and a Service Network. The hierarchical watchdog mechanism is able to quickly detect faults on each node, as the Host and the high performance NIC guard each other while every node monitors its own first neighbours in the mesh. Duplicated and distributed Supervisor Nodes receive communication by means of diagnostic messages routed through either the Service Network or the N -dimensional Network, then assemble a global picture of the system status. In this way our approach allows achieving a Fault Awareness with no-single-point-of-failure. We describe an implementation of this hardware/software co-design for our high performance 3D torus NIC, with a focus on how routed diagnostic messages do not affect the system performances. We approach fault tolerance for distributed systems from fault detection and awareness.We propose a HW/SW mechanism based on a mutual watchdog mechanism between Host and NIC.A double diagnostic message path leads to resilient systemic fault awareness.Our tool can interface fault reaction/recovery systems to trigger them automatically.Our mechanism has no impact on system performance.

[1]  Rajendra Patrikar,et al.  Implementation of Watch Dog Timer for Fault Tolerant Computing on Cluster Server , 2008 .

[2]  Dhabaleswar K. Panda,et al.  Monitoring and Predicting Hardware Failures in HPC Clusters with FTB-IPMI , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[3]  Davide Rossetti,et al.  APEnet+ 34 Gbps data transmission system and custom transmission logic , 2013 .

[4]  A H Bhagyashree,et al.  A hierarchical fault detection and recovery in a computational grid using watchdog timers , 2010, 2010 International Conference on Communication and Computational Intelligence (INCOCCI).

[5]  Franck Cappello,et al.  Fault prediction under the microscope: A closer look into HPC systems , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[6]  John T. Daly,et al.  A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..

[7]  Johan Vounckx,et al.  Fault-Tolerance in Massively Parallel Systems , 1994 .

[8]  Davide Rossetti,et al.  APEnet+: a 3D Torus network optimized for GPU-based HPC Systems , 2012 .

[9]  Davide Rossetti,et al.  QUonG: A GPU-based HPC System Dedicated to LQCD Computing , 2011, 2011 Symposium on Application Accelerators in High-Performance Computing.

[10]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[11]  Bran Selic,et al.  A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems , 2013, The Journal of Supercomputing.

[12]  James H. Laros,et al.  rMPI : increasing fault resiliency in a message-passing environment. , 2011 .

[13]  John Daly A Model for Predicting the Optimum Checkpoint Interval for Restart Dumps , 2003, International Conference on Computational Science.

[14]  Pier Stanislao Paolucci,et al.  The Distributed Network Processor: a novel off-chip and on-chip interconnection network architecture , 2012, ArXiv.

[15]  Davide Rossetti,et al.  APEnet+ project status , 2012 .

[16]  Pier Stanislao Paolucci,et al.  LO-FA-MO: Fault Detection and Systemic Awareness for the QUonG Computing System , 2014, 2014 IEEE 33rd International Symposium on Reliable Distributed Systems.

[17]  David E. Culler,et al.  The ganglia distributed monitoring system: design, implementation, and experience , 2004, Parallel Comput..

[18]  Rainer Leupers,et al.  EURETILE 2010-2012 summary: first three years of activity of the European Reference Tiled Experiment , 2013, ArXiv.

[19]  Israel Koren,et al.  Software-Based Failure Detection and Recovery in Programmable Network Interfaces , 2007, IEEE Transactions on Parallel and Distributed Systems.