Automated Rule-Based Diagnosis Through a Distributed Monitor System

In today's world, where distributed systems form many of our critical infrastructures, dependability outages are becoming increasingly common. In many situations, it is necessary to not only detect a failure but also to diagnose the failure, that is, to identify the source of the failure. Diagnosis is challenging, since high-throughput applications with frequent interactions between the different components allow fast error propagation. It is desirable to consider applications as blackboxes for the diagnostic process. In this paper, we propose a Monitor architecture for diagnosing failures in large-scale network protocols. The monitor only observes the message exchanges between the protocol entities (PEs) remotely and does not access the internal protocol state. At runtime, it builds a causal graph between the PEs based on their communication and uses this together with a rule base of allowed state-transition paths to diagnose the failure. The tests used for the diagnosis are based on the rule base and are assumed to have imperfect coverage. The hierarchical monitor framework allows distributed diagnosis handling failures at individual Monitors. The framework is implemented and applied to a reliable multicast protocol executing on our campuswide network. Fault injection experiments are carried out to evaluate the accuracy and latency of the diagnosis.

[1]  Richard Mortier,et al.  Magpie: Online Modelling and Performance-aware Systems , 2003, HotOS.

[2]  Marcos K. Aguilera,et al.  Performance debugging for distributed systems of black boxes , 2003, SOSP '03.

[3]  Miguel Correia,et al.  Low complexity Byzantine-resilient consensus , 2005, Distributed Computing.

[4]  Miroslaw Malek,et al.  The consensus problem in fault-tolerant computing , 1993, CSUR.

[5]  Friedemann Mattern,et al.  Detecting causal relationships in distributed computations: In search of the holy grail , 1994, Distributed Computing.

[6]  Saurabh Bagchi,et al.  Dependency Analysis in Distributed Systems using Fault Injection: Application to Problem Determination in an e-commerce Environment , 2001, DSOM.

[7]  Henrique Madeira,et al.  Experimental evaluation of the fail-silent behavior in computers without error masking , 1994, Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing.

[8]  Kenneth L. McMillan,et al.  Symbolic model checking , 1992 .

[9]  Joseph L. Hellerstein GAP: A General Approach to Quantitative Diagnosis of Performance Problems , 2004, Journal of Network and Systems Management.

[10]  A. Jefferson Offutt,et al.  Generating Tests from UML Specifications , 1999, UML.

[11]  Kang G. Shin,et al.  On Probabilistic Diagnosis of Multiprocessor Systems Using Multiple Syndromes , 1994, IEEE Trans. Parallel Distributed Syst..

[12]  S. Louis Hakimi,et al.  On Models for Diagnosable Systems and Probabilistic Fault Diagnosis , 1976, IEEE Transactions on Computers.

[13]  Guy Juanole,et al.  Observer-A Concept for Formal On-Line Validation of Distributed Systems , 1994, IEEE Trans. Software Eng..

[14]  S. Louis Hakimi,et al.  An optimal algorithm for distributed system level diagnosis , 1991, [1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium.

[15]  Takashi Nanya,et al.  A Hierarachical Adaptive Distributed System-Level Diagnosis Algorithm , 1998, IEEE Trans. Computers.

[16]  A. Avizienis,et al.  Dependable computing: From concepts to design diversity , 1986, Proceedings of the IEEE.

[17]  Mischa Schwartz,et al.  Schemes for fault identification in communication networks , 1995, TNET.

[18]  Robert K. Brayton,et al.  Partial-Order Reduction in Symbolic State Space Exploration , 1997, CAV.

[19]  Achour Mostéfaoui,et al.  Crash-resilient time-free eventual leadership , 2004, Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems, 2004..

[20]  Aaron B. Brown,et al.  An active approach to characterizing dynamic dependencies for problem determination in a distributed environment , 2001, 2001 IEEE/IFIP International Symposium on Integrated Network Management Proceedings. Integrated Network Management VII. Integrated Management Strategies for the New Millennium (Cat. No.01EX470).

[21]  Ravishankar K. Iyer,et al.  Measurement-Based Analysis of Error Latency , 1987, IEEE Transactions on Computers.

[22]  Kavita Ravi,et al.  High-density reachability analysis , 1995, ICCAD.

[23]  Sam Toueg,et al.  Unreliable failure detectors for reliable distributed systems , 1996, JACM.

[24]  Edmund M. Clarke,et al.  Representing circuits more efficiently in symbolic model checking , 1991, 28th ACM/IEEE Design Automation Conference.

[25]  Kenneth L. McMillan,et al.  Symbolic model checking: an approach to the state explosion problem , 1992 .

[26]  Saurabh Bagchi,et al.  Failure handling in a reliable multicast protocol for improving buffer utilization and accommodating heterogeneous receivers , 2004, 10th IEEE Pacific Rim International Symposium on Dependable Computing, 2004. Proceedings..

[27]  Peter M. Chen,et al.  How fail-stop are faulty programs? , 1998, Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing (Cat. No.98CB36224).

[28]  Miguel Castro,et al.  Proactive recovery in a Byzantine-fault-tolerant system , 2000, OSDI.

[29]  Dah-Ming Chiu,et al.  A congestion control algorithm for tree-based reliable multicast protocols , 2002, Proceedings.Twenty-First Annual Joint Conference of the IEEE Computer and Communications Societies.

[30]  Kang G. Shin,et al.  Optimal and Efficient Probabilistic Distributed Diagnosis Schemes , 1993, IEEE Trans. Computers.

[31]  Robbert van Renesse,et al.  Astrolabe: A robust and scalable technology for distributed system monitoring, management, and data mining , 2003, TOCS.

[32]  Ozalp Babaoglu,et al.  Consistent global states of distributed systems: fundamental concepts and mechanisms , 1993 .

[33]  Ravishankar K. Iyer,et al.  A framework for database audit and control flow checking for a wireless telephone network controller , 2001, 2001 International Conference on Dependable Systems and Networks.

[34]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[35]  Christophe Meudec,et al.  Automatic generation of software test cases from formal specifications , 1998 .

[36]  Mohammad Zulkernine,et al.  A compositional approach to monitoring distributed systems , 2002, Proceedings International Conference on Dependable Systems and Networks.

[37]  MatternFriedemann,et al.  Detecting causal relationships in distributed computations , 1994 .

[38]  Richard W. Buskens,et al.  Distributed on-line diagnosis in the presence of arbitrary faults , 1993, FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing.

[39]  GERNOT METZE,et al.  On the Connection Assignment Problem of Diagnosable Systems , 1967, IEEE Trans. Electron. Comput..

[40]  Samuel T. King,et al.  Backtracking intrusions , 2003, SOSP '03.

[41]  Edmund M. Clarke,et al.  Symbolic Model Checking with Partitioned Transistion Relations , 1991, VLSI.

[42]  L. Alvisi,et al.  A Survey of Rollback-Recovery Protocols , 2002 .

[43]  Saurabh Bagchi,et al.  Self checking network protocols: a monitor based approach , 2004, Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems, 2004..

[44]  Sam Toueg,et al.  Asynchronous consensus and broadcast protocols , 1985, JACM.

[45]  Miguel Correia,et al.  The Design of a COTSReal-Time Distributed Security Kernel , 2002, EDCC.

[46]  Miguel Correia,et al.  How to tolerate half less one Byzantine nodes in practical distributed systems , 2004, Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems, 2004..

[47]  Sampath Rangarajan,et al.  Probabilistic diagnosis of multiprocessor systems with arbitrary connectivity , 1989, [1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.