Metric Ranking of Invariant Networks with Belief Propagation

The management of large-scale distributed information systems relies on the effective use and modeling of monitoring data collected at various points in the distributed information systems. A promising approach is to discover invariant relationships among the monitoring data and generate invariant networks, where a node is a monitoring data source (metric) and a link indicates an invariant relationship between two monitoring data. Such an invariant network representation can help system experts to localize and diagnose the system faults by examining those broken invariant relationships and their related metrics, because system faults usually propagate among the monitoring data and eventually lead to some broken invariant relationships. However, at one time, there are usually a lot of broken links (invariant relationships) within an invariant network. Without proper guidance, it is difficult for system experts to manually inspect this large number of broken links. Thus, a critical challenge is how to effectively and efficiently rank metrics (nodes) of invariant networks according to the anomaly levels of metrics. The ranked list of metrics will provide system experts with useful guidance for them to localize and diagnose the system faults. To this end, we propose to model the nodes and the broken links as a Markov Random Field (MRF), and develop an iteration algorithm to infer the anomaly of each node based on belief propagation (BP). Finally, we validate the proposed algorithm on both real-world and synthetic data sets to illustrate its effectiveness.

[1]  Cristiana Amza,et al.  Semantic-Driven Model Composition for Accurate Anomaly Diagnosis , 2008, 2008 International Conference on Autonomic Computing.

[2]  Rong Jin,et al.  Learning to Rank by Optimizing NDCG Measure , 2009, NIPS.

[3]  Yao Zhao,et al.  Multimodal Fusion for Video Search Reranking , 2010, IEEE Transactions on Knowledge and Data Engineering.

[4]  Lennart Ljung,et al.  System Identification: Theory for the User , 1987 .

[5]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[6]  Haifeng Chen,et al.  Efficient and Scalable Algorithms for Inferring Likely Invariants in Distributed Systems , 2007, IEEE Transactions on Knowledge and Data Engineering.

[7]  Haifeng Chen,et al.  Modeling and Tracking of Transaction Flow Dynamics for Fault Detection in Complex Systems , 2006, IEEE Transactions on Dependable and Secure Computing.

[8]  Lennart Ljung,et al.  System identification (2nd ed.): theory for the user , 1999 .

[9]  Alex Simpkins,et al.  System Identification: Theory for the User, 2nd Edition (Ljung, L.; 1999) [On the Shelf] , 2012, IEEE Robotics & Automation Magazine.

[10]  Haifeng Chen,et al.  Discovering likely invariants of distributed transaction systems for autonomic system management , 2006, 2006 IEEE International Conference on Autonomic Computing.