Fingerpointing correlated failures in replicated systems

Replicated systems are often hosted over underlying group communication protocols that provide totally ordered, reliable delivery of messages. In the face of a performance problem at a single node, these protocols can cause correlated performance degradations at even non-faulty nodes, leading to potential red herrings in failure diagnosis. We propose a fingerpointing approach that combines node-level (local) anomaly detection, followed by system-wide (global) fingerpointing. The local anomaly detection relies on threshold-based analyses of system metrics, while global fingerpointing is based on the hypothesis that the root-cause of the failure is the node with an "odd-man-out" view of the anomalies. We compare the results of applying three classifiers - a heuristic algorithm, an unsupervised learner (k-means clustering), and a supervised learner (k-nearest-neighbor) - to finger-point the faulty node.

[1]  Miguel Oom Temudo de Castro,et al.  Practical Byzantine fault tolerance , 1999, OSDI '99.

[2]  Marcos K. Aguilera,et al.  Performance debugging for distributed systems of black boxes , 2003, SOSP '03.

[3]  Armando Fox,et al.  Detecting application-level failures in component-based Internet services , 2005, IEEE Transactions on Neural Networks.

[4]  Fred B. Schneider,et al.  Implementing fault-tolerant services using the state machine approach: a tutorial , 1990, CSUR.

[5]  Armando Fox,et al.  Capturing, indexing, clustering, and retrieving system history , 2005, SOSP '05.

[6]  Frank Feather,et al.  A case study of Ethernet anomalies in a distributed computing environment , 1990 .

[7]  Group Communication : Helping or Obscuring Failure Diagnosis ? , 2006 .

[8]  Yair Amir,et al.  A low latency, loss tolerant architecture and protocol for wide area group communication , 2000, Proceeding International Conference on Dependable Systems and Networks. DSN 2000.

[9]  Idit Keidar,et al.  Group communication specifications: a comprehensive study , 2001, CSUR.

[10]  Mike Hibler,et al.  An integrated experimental environment for distributed systems and networks , 2002, OSDI '02.

[11]  Isabelle Guyon,et al.  A Stability Based Method for Discovering Structure in Clustered Data , 2001, Pacific Symposium on Biocomputing.

[12]  Helen J. Wang,et al.  Automatic Misconfiguration Troubleshooting with PeerPressure , 2004, OSDI.

[13]  Amin Vahdat,et al.  Pip: Detecting the Unexpected in Distributed Systems , 2006, NSDI.