Toward Understanding Soft Faults in High Performance Cluster Networks

Fault management in high performance cluster networks has been focused on the notion of hard faults (i.e., link or node failures). Network degradations that negatively impact performance but do not result in failures often go unnoticed. In this paper, we classify such degradations as soft faults. In addition, we identify consistent performance as an important requirement in cluster networks. Using this service requirement, we describe a comprehensive strategy for cluster fault management.

[1]  Ramesh Subramonian,et al.  LogP: towards a realistic model of parallel computation , 1993, PPOPP '93.

[2]  Remy Evard Chiba city: the Argonne scalable cluster , 2001 .

[3]  Karsten Schwan,et al.  Application-Dependent Dynamic Monitoring of Distributed and Parallel Systems , 1993, IEEE Trans. Parallel Distributed Syst..

[4]  Chuanyi Ji,et al.  Proactive network fault detection , 1997, Proceedings of INFOCOM '97.

[5]  PredictionCelso L. Mendes,et al.  Performance Stability and Prediction , 1994 .

[6]  Barton P. Miller,et al.  Dynamic control of performance monitoring on large scale parallel systems , 1993, ICS '93.

[7]  R. Sarnath,et al.  Proceedings of the International Conference on Parallel Processing , 1992 .

[8]  D.E. Culler,et al.  Effects Of Communication Latency, Overhead, And Bandwidth In A Cluster Architecture , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[9]  William Gropp,et al.  Prototype of AM3: active mapper and monitoring module for Myrinet environments , 2002, 27th Annual IEEE Conference on Local Computer Networks, 2002. Proceedings. LCN 2002..

[10]  D.A. Reed,et al.  Scalable performance analysis: the Pablo performance analysis environment , 1993, Proceedings of Scalable Parallel Libraries Conference.

[11]  W. E Nagel 1988 International conference on supercomputing , 1988 .

[12]  Trevor Mudge,et al.  Proceedings of the 24th annual international symposium on Computer architecture , 1997 .

[13]  Karsten Schwan,et al.  Falcon: On-line Monitoring and Steering of Parallel Programs , 1995 .

[14]  C. S. Hood,et al.  Proactive network-fault detection [telecommunications] , 1997 .

[15]  Karsten Schwan,et al.  Progress: A Toolkit for Interactive Program Steering , 1995, ICPP.