Probabilistic Diagnosis through Non-Intrusive Monitoring in Distributed Applications

With dependability outages in distributed critical infrastructures, it is often not enough to detect a failure, but it is also required to diagnose the failure, i.e., to identify the source of the failure. Diagnosis is challenging because fast error propagation may occur in high throughput distributed applications. The diagnosis often needs to be probabilistic in nature due to imperfect observability of the payload system, inability to do white-box testing, constraints on the amount of state that can be maintained at the diagnostic process, and imperfect tests used to verify the system. In this paper, we extend an existing Monitor architecture, for probabilistic diagnosis of failures in large-scale network protocols. The Monitor only observes the message exchanges between the protocol entities (PEs) remotely and does not access internal protocol state. At runtime, it builds a causal & aggregate graph between the PEs based on their communication and uses this together with a rule base for diagnosing the failure. The Monitor computes for each suspected PE, a probability for the error having originated in that PE and propagated to the failure detection site. The framework is applied to a test-bed consisting of a reliable multicast protocol executing on the Purdue campus-wide network. Error injection experiments are performed to evaluate the accuracy and the performance overhead of the diagnostic process.

[1]  Boris Gruschke,et al.  INTEGRATED EVENT MANAGEMENT: EVENT CORRELATION USING DEPENDENCY GRAPHS , 1998 .

[2]  Seraphin B. Calo,et al.  Alarm correlation and fault identification in communication networks , 1994, IEEE Trans. Commun..

[3]  Miguel Correia,et al.  The Design of a COTSReal-Time Distributed Security Kernel , 2002, EDCC.

[4]  Saurabh Bagchi,et al.  Failure handling in a reliable multicast protocol for improving buffer utilization and accommodating heterogeneous receivers , 2004, 10th IEEE Pacific Rim International Symposium on Dependable Computing, 2004. Proceedings..

[5]  Miguel Correia,et al.  How to tolerate half less one Byzantine nodes in practical distributed systems , 2004, Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems, 2004..

[6]  Sampath Rangarajan,et al.  Probabilistic diagnosis of multiprocessor systems with arbitrary connectivity , 1989, [1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[7]  Douglas M. Blough,et al.  Distributed diagnosis in dynamic fault environments , 2004, IEEE Transactions on Parallel and Distributed Systems.

[8]  Hervé Debar,et al.  Aggregation and Correlation of Intrusion-Detection Alerts , 2001, Recent Advances in Intrusion Detection.

[9]  Salvatore J. Stolfo,et al.  A coding approach to event correlation , 1995, Integrated Network Management.

[10]  Domenico Cotroneo,et al.  Effective fault treatment for improving the dependability of COTS and legacy-based applications , 2004, IEEE Transactions on Dependable and Secure Computing.

[11]  Domenico Cotroneo,et al.  Implementation of threshold-based diagnostic mechanisms for COTS-based applications , 2002, 21st IEEE Symposium on Reliable Distributed Systems, 2002. Proceedings..

[12]  Sheng Ma,et al.  Adaptive diagnosis in distributed systems , 2005, IEEE Transactions on Neural Networks.

[13]  Marcos K. Aguilera,et al.  Performance debugging for distributed systems of black boxes , 2003, SOSP '03.

[14]  Eric A. Brewer,et al.  Pinpoint: problem determination in large, dynamic Internet services , 2002, Proceedings International Conference on Dependable Systems and Networks.

[15]  Takashi Nanya,et al.  A Hierarachical Adaptive Distributed System-Level Diagnosis Algorithm , 1998, IEEE Trans. Computers.

[16]  Saurabh Bagchi,et al.  Self checking network protocols: a monitor based approach , 2004, Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems, 2004..

[17]  Frédéric Cuppens,et al.  Alert correlation in a cooperative intrusion detection framework , 2002, Proceedings 2002 IEEE Symposium on Security and Privacy.

[18]  Richard W. Buskens,et al.  Distributed on-line diagnosis in the presence of arbitrary faults , 1993, FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing.

[19]  GERNOT METZE,et al.  On the Connection Assignment Problem of Diagnosable Systems , 1967, IEEE Trans. Electron. Comput..

[20]  Nagarajan Kandasamy,et al.  Time-constrained failure diagnosis in distributed embedded systems , 2002, Proceedings International Conference on Dependable Systems and Networks.

[21]  Armando Fox,et al.  Pinpoint: problem determination in large , 2002 .

[22]  Miguel Correia,et al.  Automated Monitor Based Diagnosis in Distributed Systems , 2005 .

[23]  Richard Mortier,et al.  Magpie: Online Modelling and Performance-aware Systems , 2003, HotOS.

[24]  Miguel Castro,et al.  Practical byzantine fault tolerance and proactive recovery , 2002, TOCS.

[25]  Kang G. Shin,et al.  Probabilistic diagnosis of multiprocessor systems , 1994, CSUR.

[26]  Roman Obermaisser,et al.  Out-of-norm assertions [diagnostic mechanism] , 2005, 11th IEEE Real Time and Embedded Technology and Applications Symposium.

[27]  Sheng Ma,et al.  Intelligent probing: A cost-effective approach to fault diagnosis in computer networks , 2002, IBM Syst. J..

[28]  William H. Sanders,et al.  Automatic model-driven recovery in distributed systems , 2005, 24th IEEE Symposium on Reliable Distributed Systems (SRDS'05).

[29]  Dah-Ming Chiu,et al.  A congestion control algorithm for tree-based reliable multicast protocols , 2002, Proceedings.Twenty-First Annual Joint Conference of the IEEE Computer and Communications Societies.

[30]  David A. Patterson,et al.  Embracing Failure: A Case for Recovery-Oriented Computing (ROC) , 2001 .

[31]  Robert W. Kembel Fibre Channel A Comprehensive Introduction , 1998 .

[32]  Douglas S. Reeves,et al.  Tracing Based Active Intrusion Response , 2002 .

[33]  Helen J. Wang,et al.  PeerPressure for automatic troubleshooting , 2004, SIGMETRICS '04/Performance '04.