论文信息 - Fingerpointing correlated failures in replicated systems

Fingerpointing correlated failures in replicated systems

Replicated systems are often hosted over underlying group communication protocols that provide totally ordered, reliable delivery of messages. In the face of a performance problem at a single node, these protocols can cause correlated performance degradations at even non-faulty nodes, leading to potential red herrings in failure diagnosis. We propose a fingerpointing approach that combines node-level (local) anomaly detection, followed by system-wide (global) fingerpointing. The local anomaly detection relies on threshold-based analyses of system metrics, while global fingerpointing is based on the hypothesis that the root-cause of the failure is the node with an "odd-man-out" view of the anomalies. We compare the results of applying three classifiers - a heuristic algorithm, an unsupervised learner (k-means clustering), and a supervised learner (k-nearest-neighbor) - to finger-point the faulty node.

[1] Miguel Oom Temudo de Castro,et al. Practical Byzantine fault tolerance , 1999, OSDI '99.

[2] Marcos K. Aguilera,et al. Performance debugging for distributed systems of black boxes , 2003, SOSP '03.

[3] Armando Fox,et al. Detecting application-level failures in component-based Internet services , 2005, IEEE Transactions on Neural Networks.

[4] Fred B. Schneider,et al. Implementing fault-tolerant services using the state machine approach: a tutorial , 1990, CSUR.

[5] Armando Fox,et al. Capturing, indexing, clustering, and retrieving system history , 2005, SOSP '05.

[6] Frank Feather,et al. A case study of Ethernet anomalies in a distributed computing environment , 1990 .

[7] Group Communication : Helping or Obscuring Failure Diagnosis ? , 2006 .

[8] Yair Amir,et al. A low latency, loss tolerant architecture and protocol for wide area group communication , 2000, Proceeding International Conference on Dependable Systems and Networks. DSN 2000.

[9] Idit Keidar,et al. Group communication specifications: a comprehensive study , 2001, CSUR.

[10] Mike Hibler,et al. An integrated experimental environment for distributed systems and networks , 2002, OSDI '02.

[11] Isabelle Guyon,et al. A Stability Based Method for Discovering Structure in Clustered Data , 2001, Pacific Symposium on Biocomputing.

[12] Helen J. Wang,et al. Automatic Misconfiguration Troubleshooting with PeerPressure , 2004, OSDI.

[13] Amin Vahdat,et al. Pip: Detecting the Unexpected in Distributed Systems , 2006, NSDI.