Proactive Fault Monitoring in Enterprise Servers

Abstract New proactive fault monitoring innovations are being developed, demonstrated onexecuting servers, and productized for enhancing the reliability, availability, andserviceability of enterprise-class servers. A continuous system telemetry harness (CSTH)has been developed that collects time series signals relating to the health of dynamicallyexecuting servers. These time series provide quantitative metrics associated with physicalvariables (distributed temperatures, voltages, and currents throughout the system), "soft"performance variables (loads, throughputs, queue lengths, bit error rates, etc.), andvarious quality-of-service (QoS) metrics. The CSTH signals are continuously archived toan offline circular file (i.e. the "Black Box Flight Recorder") that is helping to identify andeliminate costly sources of No-Trouble-Founds (NTFs) in Sun systems; and the signalsare concurrently processed in real time using advanced pattern recognition for proactiveanomaly detection. Examples are presented of the uses of the CSTH coupled with pattern recognition forhigh-sensitivity predictive failure analysis that is helping to increase component andsystem availability goals while decreasing the incidence of "No Trouble Found" (NTF)events that have become a costly serviceability/warranty issue in the enterprise computingindustry.

[2]  R. W. King,et al.  Model-based nuclear power plant monitoring and fault detection: Theoretical foundations , 1997 .

[3]  J-C. Laprie,et al.  DEPENDABLE COMPUTING AND FAULT TOLERANCE : CONCEPTS AND TERMINOLOGY , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..

[4]  Kenny C. Gross,et al.  Advanced pattern recognition for detection of complex software aging phenomena in online transaction processing servers , 2002, Proceedings International Conference on Dependable Systems and Networks.

[5]  Stephan W. Wegerich,et al.  Nonparametric modeling of vibration signal features for equipment health monitoring , 2003, 2003 IEEE Aerospace Conference Proceedings (Cat. No.03TH8652).

[6]  J. Andel Sequential Analysis , 2022, The SAGE Encyclopedia of Research Design.