Experimental analysis of the first order time difference of indicators used in the monitoring of complex systems

Complex and real time systems often operate under variable and non-stationary conditions, thus requiring efficient and extensive monitoring and error detection solutions. Amongst the many, we focus on anomaly detection techniques, which require measuring the evolution of the monitored indicators through time to identify anomalies i.e., deviations from the expected operational behavior. In this paper, we investigate the possibility to model the evolution of indicators through time using the random walk model. In particular, we focus on the detection of system anomalies at the application level (software errors), based on the monitoring of indicators at the Operating System level. The approach is based on the experimental evaluation of a large set of heterogeneous indicators, acquired under different operating conditions, both in terms of workload and fault load, on an air traffic management target system. The results of the analysis show that for a large number of cases, the histogram of the first order time differences well approximates a Gaussian distribution, independently of the nature of the indicator and its statistical distribution. Such outcomes suggest that the idea of adopting a Gaussian random walk model for several monitoring indicators has an experimental support and deserves be further investigated on a wider scale, in order to determine its range of applicability and representativeness.

[2]  M. Picardello,et al.  Random walks and discrete potential theory : Cortona 1997 , 1999 .

[3]  Ann Q. Gates,et al.  A taxonomy and catalog of runtime software-fault monitoring tools , 2004, IEEE Transactions on Software Engineering.

[4]  Marco Vieira,et al.  The OLAP and data warehousing approaches for analysis and sharing of results from dependability evaluation experiments , 2003, 2003 International Conference on Dependable Systems and Networks, 2003. Proceedings..

[5]  Andrea Bondavalli,et al.  Resilient estimation of synchronisation uncertainty through software clocks , 2013, Int. J. Crit. Comput. Based Syst..

[6]  Andrea Bondavalli,et al.  Towards identifying OS-level anomalies to detect application software failures , 2011, 2011 IEEE International Workshop on Measurements and Networking Proceedings (M&N).

[7]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[8]  Domenico Cotroneo,et al.  Operating system support to detect application hangs , 2008 .

[9]  Andrea Bondavalli,et al.  Safe estimation of time uncertainty of local clocks , 2009, 2009 International Symposium on Precision Clock Synchronization for Measurement, Control and Communication.

[10]  Andrea Bondavalli,et al.  Master Failure Detection Protocol in Internal Synchronization Environment , 2013, IEEE Transactions on Instrumentation and Measurement.

[11]  Douglas C. Montgomery,et al.  Statistical Quality Control , 2008 .

[12]  Saurabh Bagchi,et al.  Automated online monitoring of distributed applications through external monitors , 2006, IEEE Transactions on Dependable and Secure Computing.

[13]  Pál Révész,et al.  Random walk in random and non-random environments , 1990 .

[14]  A. Bondavalli,et al.  Improving robustness of the synchronization quality of IEEE1588 nodes , 2010, 2010 IEEE International Symposium on Precision Clock Synchronization for Measurement, Control and Communication.

[15]  Andrea Bondavalli,et al.  A New Approach and a Related Tool for Dependability Measurements on Distributed Systems , 2010, IEEE Transactions on Instrumentation and Measurement.