Statistical Monitoring + Predictable Recovery = Self-*

It is by now motherhood-and-apple-pie that complex distributed Internet services form the basis not only of ecommerce but increasingly of mission-critical networkbased applications. What is new is that the workload and internal architecture of three-tier enterprise applications presents the opportunity for a new approach to keeping them running in the face of both inaturali failures and adversarial attacks. The core of the approach is anomaly detection and localization based on statistical machine learning techniques. Unlike previous approaches, we propose anomaly detection and pattern mining not only for operational statistics such as mean response time, but also for structural behaviors of the systemowhat parts of the system, in what combinations, are being exercised in response to different kinds of external stimuli. In addition, rather than building baseline models a priori, we extract them by observing the behavior of the system over a short period of time during normal operation. We explain the necessary underlying assumptions and why they can be realized by systems research, report on some early successes using the approach, describe benets of the approach that make it competitive as a path toward selfmanaging systems, and outline some research challenges. Our hope is that this approach will enable inew sciencei in the design of self-managing systems by allowing the rapid and widespread application of statistical learning theory techniques (SLT) to problems of system dependability.

[1]  Noah Treuhaft,et al.  Recovery Oriented Computing (ROC): Motivation, Definition, Techniques, and Case Studies , 2002 .

[2]  Yi-Min Wang,et al.  Persistent-state checkpoint comparison for troubleshooting configuration failures , 2003, 2003 International Conference on Dependable Systems and Networks, 2003. Proceedings..

[3]  George Candea,et al.  JAGR: an autonomous self-recovering application server , 2003, 2003 Autonomic Computing Workshop.

[4]  George Candea,et al.  Crash-Only Software , 2003, HotOS.

[5]  David Wetherall,et al.  Towards an active network architecture , 1996, CCRV.

[6]  Michael I. Jordan,et al.  Bug isolation via remote program sampling , 2003, PLDI.

[7]  Michèle Basseville,et al.  Detection of abrupt changes: theory and application , 1993 .

[8]  Joseph L. Hellerstein,et al.  Using Control Theory to Achieve Service Level Objectives In Performance Management , 2001, 2001 IEEE/IFIP International Symposium on Integrated Network Management Proceedings. Integrated Network Management VII. Integrated Management Strategies for the New Millennium (Cat. No.01EX470).

[9]  Eamonn J. Keogh,et al.  Finding surprising patterns in a time series database in linear time and space , 2002, KDD.

[10]  Armando Fox,et al.  Session State: Beyond Soft State , 2004, NSDI.

[11]  M. Lam,et al.  Tracking down software bugs using automatic anomaly detection , 2002, Proceedings of the 24th International Conference on Software Engineering. ICSE 2002.

[12]  Dawson R. Engler,et al.  Bugs as deviant behavior: a general approach to inferring errors in systems code , 2001, SOSP.