NIRVANA: A Non-intrusive Black-Box Monitoring Framework for Rack-Level Fault Detection

Many organizations today still manage mid or large in-house data centers that require very expensive maintenance efforts, including fault detection. Common monitoring frameworks used to quickly detect faults are complex to deploy/maintain, expensive, and intrusive as they require the installation of probes on monitored hw/sw to collect raw data. Such intrusiveness can be problematic as it imposes installation/management overhead and may interfere with security/privacy policies. In this paper we introduce NIRVANA, a novel monitoring system for fault detection that works at rack-level and is (i) non-intrusive, i.e., it does not require the installation of software probes on the hosts to be monitored and (ii) black-box, i.e., agnostic with respect to monitored applications. At the core of our solution lies the observation that aggregated features that can be monitored at rack-level in a non-intrusive and black-box way, show predictable behaviors while the system works in both fault-free and faulty states, it is therefore possible to detect and identify faults by monitoring and analyzing any perturbations to these behaviors. An extensive experimental evaluation shows that non-intrusiveness does not significantly hamper the fault detection capabilities of the monitoring system, thus validating our approach.

[1]  Paul Barford,et al.  Generating representative Web workloads for network and server performance evaluation , 1998, SIGMETRICS '98/PERFORMANCE '98.

[2]  Chih-Jen Lin,et al.  A comparison of methods for multiclass support vector machines , 2002, IEEE Trans. Neural Networks.

[3]  Marco Vieira,et al.  On the Need for Training Failure Prediction Algorithms in Evolving Software Systems , 2014, 2014 IEEE 15th International Symposium on High-Assurance Systems Engineering.

[4]  Eyal de Lara,et al.  Non-intrusive, out-of-band and out-of-the-box systems monitoring in the cloud , 2014, SIGMETRICS '14.

[5]  Manas Ranjan Patra,et al.  An event based, non-intrusive monitoring framework for Web Service Based Systems , 2010, 2010 International Conference on Computer Information Systems and Industrial Management Applications (CISIM).

[6]  Priya Narasimhan,et al.  Tiresias: Black-Box Failure Prediction in Distributed Systems , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[7]  Roberto Baldoni,et al.  Towards a Non-intrusive Recognition of Anomalous System Behavior in Data Centers , 2014, SAFECOMP Workshops.

[8]  Xiaohui Gu,et al.  FChain: Toward Black-Box Online Fault Localization for Cloud Systems , 2013, 2013 IEEE 33rd International Conference on Distributed Computing Systems.

[9]  Paramvir Bahl,et al.  Towards highly reliable enterprise network services via inference of multi-level dependencies , 2007, SIGCOMM '07.

[10]  David E. Culler,et al.  The ganglia distributed monitoring system: design, implementation, and experience , 2004, Parallel Comput..

[11]  Armando Fox,et al.  Detecting application-level failures in component-based Internet services , 2005, IEEE Transactions on Neural Networks.

[12]  Haifeng Chen,et al.  Proactive Workload Management in Hybrid Cloud Computing , 2014, IEEE Transactions on Network and Service Management.

[13]  Victoria J. Hodge,et al.  A Survey of Outlier Detection Methodologies , 2004, Artificial Intelligence Review.

[14]  Roberto Baldoni,et al.  On-line failure prediction in safety-critical systems , 2015, Future Gener. Comput. Syst..

[15]  Marco Vieira,et al.  Adaptive Failure Prediction for Computer Systems: A Framework and a Case Study , 2015, 2015 IEEE 16th International Symposium on High Assurance Systems Engineering.

[16]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[17]  Gregor von Laszewski,et al.  Towards On Demand IT Service Deployment , 2009 .

[18]  Paramvir Bahl,et al.  Detailed diagnosis in enterprise networks , 2009, SIGCOMM '09.

[19]  George Spanoudakis,et al.  Non-Intrusive Monitoring of Service-Based Systems , 2006, Int. J. Cooperative Inf. Syst..

[20]  Carlos Pereira,et al.  The Time Dimension in Predicting Failures: A Case Study , 2013, 2013 Sixth Latin-American Symposium on Dependable Computing.

[21]  Neal Leavitt,et al.  Hybrid Clouds Move to the Forefront , 2013, Computer.

[22]  Carl E. Landwehr,et al.  Basic concepts and taxonomy of dependable and secure computing , 2004, IEEE Transactions on Dependable and Secure Computing.