Software-Based Failure Detection and Recovery in Programmable Network Interfaces

Emerging network technologies have complex network interfaces that have renewed concerns about network reliability. In this paper, we present an effective low-overhead fault tolerance technique to recover from network interface failures. Failure detection is based on a software watchdog timer that detects network processor hangs and a self-testing scheme that detects interface failures other than processor hangs. The proposed self-testing scheme achieves failure detection by periodically directing the control flow to go through only active software modules in order to detect errors that affect instructions in the local memory of the network interface. Our failure recovery is achieved by restoring the state of the network interface using a small backup copy containing just the right amount of information required for complete recovery. The paper shows how this technique can be made to minimize the performance impact to the host system and be completely transparent to the user.

[1]  Bogdan Nicolescu,et al.  Detecting Soft Errors by a Purely Software Approach: Method, Tools and Experimental Results , 2003, DATE.

[2]  Marcel Waldvogel,et al.  IBM PowerNP network processor: Hardware, software, and applications , 2003, IBM J. Res. Dev..

[3]  James L. Walsh,et al.  IBM experiments in soft fails in computer electronics (1978-1994) , 1996, IBM J. Res. Dev..

[4]  J. J. Beahan,et al.  Radiation fault modeling and fault rate estimation for a COTS based space-borne supercomputer , 2002, Proceedings, IEEE Aerospace Conference.

[5]  Niraj K. Jha,et al.  Fault-tolerant computer system design , 1996, IEEE Parallel & Distributed Technology: Systems & Applications.

[6]  Ravishankar K. Iyer,et al.  Dependability analysis of a commercial high-speed network , 1997, Proceedings of IEEE 27th International Symposium on Fault Tolerant Computing.

[7]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.

[8]  Ram Chillarege Self-testing software probe system for failure detection and diagnosis , 1994, CASCON.

[9]  P. Wyckoff,et al.  EMP: Zero-Copy OS-Bypass NIC-Driven Gigabit Ethernet Message Passing , 2001, ACM/IEEE SC 2001 Conference (SC'01).

[10]  Stephen S. Yau,et al.  An Approach to Concurrent Control Flow Checking , 1980, IEEE Transactions on Software Engineering.

[11]  Joel S. Emer,et al.  The soft error problem: an architectural perspective , 2005, 11th International Symposium on High-Performance Computer Architecture.

[12]  Ravishankar K. Iyer,et al.  Analyze-NOW-an environment for collection and analysis of failures in a network of workstations , 1996, Proceedings of ISSRE '96: 7th International Symposium on Software Reliability Engineering.

[13]  Israel Koren,et al.  Low overhead fault tolerant networking in Myrinet , 2003, 2003 International Conference on Dependable Systems and Networks, 2003. Proceedings..

[14]  Charles L. Seitz,et al.  Myrinet: A Gigabit-per-Second Local Area Network , 1995, IEEE Micro.

[15]  Atm Forum ATM user-network interface (UNI) specification : version 3.1 , 1993 .

[16]  Laura L. Pullum,et al.  Software Fault Tolerance Techniques and Implementation , 2001 .

[17]  Henri E. Bal,et al.  User-Level Network Interface Protocols , 1998, Computer.

[18]  Ravishankar K. Iyer,et al.  Analyze-NOW-an environment for collection and analysis of failures in a network of workstations , 1996, IEEE Trans. Reliab..

[19]  Edward J. McCluskey,et al.  Software-implemented EDAC protection against SEUs , 2000, IEEE Trans. Reliab..

[20]  Edward J. McCluskey,et al.  Error detection by duplicated instructions in super-scalar processors , 2002, IEEE Trans. Reliab..