Supporting System-wide Similarity Queries for networked system management

Today's networked systems are extensively instrumented for collecting a wealth of monitoring data. In this paper, we propose a framework called System-wide Similarity Query (S2Q) to support a new type of similarity queries on monitoring data for managing complex networked systems. The similarity queries are defined on a novel data model that captures system states, and the implementation includes a streaming algorithm for online state-modeling computation and a companion graph-based indexing technique for fast retrieval of historical system states. S2Q simplifies many systems management tasks through a simple and intuitive query interface available to operators, and two applications are evaluated in the paper: (i) fast diagnosis of repeated failures in enterprise IT systems, and (ii) automated application traffic profiling on computer networks. For the first application, the diagnosis accuracy can reach 95% on a multi-tier web service testbed. For the second application, major network applications were automatically identified in the traffic logs from a large campus wireless network.

[1]  Sheng Ma,et al.  Quickly Finding Known Software Problems via Automated Symptom Matching , 2005, Second International Conference on Autonomic Computing (ICAC'05).

[2]  Archana Ganapathi,et al.  Why Do Internet Services Fail, and What Can Be Done About It? , 2002, USENIX Symposium on Internet Technologies and Systems.

[3]  Nick Feamster,et al.  Diagnosing network disruptions with network-wide analysis , 2007, SIGMETRICS '07.

[4]  Ranveer Chandra,et al.  What's going on?: learning communication rules in edge networks , 2008, SIGCOMM '08.

[5]  Srikanth Kandula,et al.  Shrink: a tool for failure diagnosis in IP networks , 2005, MineNet '05.

[6]  Salim Hariri,et al.  An efficient network intrusion detection method based on information theory and genetic algorithm , 2005, PCCC 2005. 24th IEEE International Performance, Computing, and Communications Conference, 2005..

[7]  Bin Yang,et al.  Projection approximation subspace tracking , 1995, IEEE Trans. Signal Process..

[8]  Haifeng Chen,et al.  Discovering likely invariants of distributed transaction systems for autonomic system management , 2006, 2006 IEEE International Conference on Autonomic Computing.

[9]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[10]  Frederick Reiss,et al.  TelegraphCQ: continuous dataflow processing , 2003, SIGMOD '03.

[11]  Michael I. Jordan,et al.  Failure diagnosis using decision trees , 2004 .

[12]  Armando Fox,et al.  Capturing, indexing, clustering, and retrieving system history , 2005, SOSP '05.

[13]  Wei-Ying Ma,et al.  Automated known problem diagnosis with event traces , 2006, EuroSys.

[14]  Gene H. Golub,et al.  Numerical methods for computing angles between linear subspaces , 1971, Milestones in Matrix Computation.

[15]  Kien A. Hua,et al.  ADMiRe: an algebraic approach to system performance analysis using data mining techniques , 2003, SAC '03.

[16]  Paramvir Bahl,et al.  Towards highly reliable enterprise network services via inference of multi-level dependencies , 2007, SIGCOMM '07.

[17]  George Candea,et al.  Automatic failure-path inference: a generic introspection technique for Internet applications , 2003, Proceedings the Third IEEE Workshop on Internet Applications. WIAPP 2003.

[18]  David Kotz,et al.  Analysis of a Campus-Wide Wireless Network , 2002, MobiCom '02.

[19]  Daniel A. Keim,et al.  Visualizing large-scale telecommunication networks and services (case study) , 1999, VIS '99.

[20]  Michael Isard,et al.  DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language , 2008, OSDI.

[21]  Dimitrios Gunopulos,et al.  Locally adaptive metrics for clustering high dimensional data , 2007, Data Mining and Knowledge Discovery.

[22]  Albert G. Greenberg,et al.  IP fault localization via risk modeling , 2005, NSDI.