LO-FA-MO: Fault Detection and Systemic Awareness for the QUonG Computing System

QUonG is a parallel computing platform developed at INFN and equipped with commodity multi-core CPUs coupled with last generation NVIDIA GPUs. Computing nodes communicate through a point-to-point, high performance, low latency 3D torus network implemented by the APEnet+ FPGA-based interconnect. Scaling of this cluster towards peta-and possibly exascale is a prominent investigation point and in this context fault tolerance issues are structural. Typical fault tolerance solutions for HPC systems (e.g. checkpoint/restart) need to be triggered to be applied in an automated and transparent way, or at least knowledge about occurring faults needs propagating in order to prompt a readjustment: an effective tool to detect faults and make the system aware of them is required. Thus, as a first step towards a fault tolerant QUonG we designed the Local Fault Monitor (LO|FA|MO), an HW/SW solution aimed at providing systemic fault awareness. LO|FA|MO allows the detection of node faults thanks to a mutual watchdog mechanism between the host and the APEnet+ NIC, moreover, diagnostic messages can be delivered to neighbour nodes through both the 3D network and a secondary connection for service communication. The double path ensures that no fault remains unknown at the global level, guaranteeing systemic fault awareness with no single point of failure. In this paper we describe our LO|FA|MO implementation, reporting preliminary measures that show scalability and its next to nil impact on system performance.

[1]  Pier Stanislao Paolucci,et al.  Design and implementation of a modular, low latency, fault-aware, FPGA-based network interface , 2013, 2013 International Conference on Reconfigurable Computing and FPGAs (ReConFig).

[2]  Davide Rossetti,et al.  APEnet+ project status , 2012 .

[3]  James H. Laros,et al.  rMPI : increasing fault resiliency in a message-passing environment. , 2011 .

[4]  Davide Rossetti,et al.  QUonG: A GPU-based HPC System Dedicated to LQCD Computing , 2011, 2011 Symposium on Application Accelerators in High-Performance Computing.

[5]  A H Bhagyashree,et al.  A hierarchical fault detection and recovery in a computational grid using watchdog timers , 2010, 2010 International Conference on Communication and Computational Intelligence (INCOCCI).

[6]  Massimo Bernaschi,et al.  GPU Peer-to-Peer Techniques Applied to a Cluster Interconnect , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[7]  Rajendra Patrikar,et al.  Implementation of Watch Dog Timer for Fault Tolerant Computing on Cluster Server , 2008 .

[8]  David Josephsen,et al.  Building a Monitoring Infrastructure with Nagios , 2007 .

[9]  Davide Rossetti,et al.  APEnet+: a 3D Torus network optimized for GPU-based HPC Systems , 2012 .

[10]  Israel Koren,et al.  Software-Based Failure Detection and Recovery in Programmable Network Interfaces , 2007, IEEE Transactions on Parallel and Distributed Systems.

[11]  Franck Cappello,et al.  Fault Tolerance in Petascale/ Exascale Systems: Current Knowledge, Challenges and Research Opportunities , 2009, Int. J. High Perform. Comput. Appl..

[12]  Bran Selic,et al.  A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems , 2013, The Journal of Supercomputing.

[13]  Johan Vounckx,et al.  Fault-Tolerance in Massively Parallel Systems , 1994 .

[14]  David E. Culler,et al.  The ganglia distributed monitoring system: design, implementation, and experience , 2004, Parallel Comput..

[15]  Dhabaleswar K. Panda,et al.  Monitoring and Predicting Hardware Failures in HPC Clusters with FTB-IPMI , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[16]  Davide Rossetti,et al.  APEnet+ 34 Gbps data transmission system and custom transmission logic , 2013 .