Adaptive Anomaly Detection in Performance Metric Streams

Continuous detection of performance anomalies such as service degradations has become critical in cloud and Internet services due to impact on quality of service and end-user experience. However, the volume and fast changing behavior of metric streams have rendered it a challenging task. Many diagnosis frameworks often rely on thresholding with stationarity or normality assumption, or on complex models requiring extensive offline training. Such techniques are known to be prone to spurious false-alarms in online settings as metric streams undergo rapid contextual changes from known baselines. Hence, we propose two unsupervised incremental techniques following a two-step strategy. First, we estimate an underlying temporal property of the stream via adaptive learning and, then we apply statistically robust control charts to recognize deviations. We evaluated our techniques by replaying over 40 time-series streams from the Yahoo! Webscope S5 datasets as well as four other traces of real Web service QoS and ISP traffic measurements. Our methods achieve high detection accuracy and few false-alarms, and better performance in general compared to an open-source package for time-series anomaly detection.

[1]  Karsten Schwan,et al.  E2EProf: Automated End-to-End Performance Management for Enterprise Systems , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[2]  Chita R. Das,et al.  CloudPD: Problem determination and diagnosis in shared dynamic clouds , 2013, 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[3]  Calton Pu,et al.  vPerfGuard: an automated model-driven framework for application performance diagnosis in consolidated cloud environments , 2013, ICPE '13.

[4]  Fred Spiring,et al.  Introduction to Statistical Quality Control , 2007, Technometrics.

[5]  Ronaldo M. Salles,et al.  Detecting VoIP calls hidden in web traffic , 2008, IEEE Transactions on Network and Service Management.

[6]  Samuel Kounev,et al.  Self‐adaptive workload classification and forecasting for proactive resource provisioning , 2014, Concurr. Comput. Pract. Exp..

[7]  Xiao Zhang,et al.  CPI2: CPU performance isolation for shared compute clusters , 2013, EuroSys '13.

[8]  Erik Elmroth,et al.  Performance Anomaly Detection and Bottleneck Identification , 2015, ACM Comput. Surv..

[9]  Miguel Rio,et al.  Internet Traffic Forecasting using Neural Networks , 2006, The 2006 IEEE International Joint Conference on Neural Network Proceedings.

[10]  Geoff Holmes,et al.  Batch-Incremental versus Instance-Incremental Learning in Dynamic and Evolving Data , 2012, IDA.

[11]  Ahmed Karmouch,et al.  Network anomaly diagnosis via statistical analysis and evidential reasoning , 2008, IEEE Transactions on Network and Service Management.

[12]  Tatsuya Sato,et al.  Integrated Monitoring Software for Application Service Managers , 2014, IEEE Transactions on Network and Service Management.

[13]  Anukool Lakhina,et al.  Multivariate Online Anomaly Detection Using Kernel Recursive Least Squares , 2007, IEEE INFOCOM 2007 - 26th IEEE International Conference on Computer Communications.

[14]  Theo Lynn,et al.  A Wavelet-inspired Anomaly Detection Framework for Cloud Platforms , 2016, CLOSER.

[15]  Christophe Diot,et al.  Diagnosing network-wide traffic anomalies , 2004, SIGCOMM.

[16]  Ryan Riegel,et al.  Generalized N-body problems: a framework for scalable computation , 2013 .

[17]  Xenofontas A. Dimitropoulos,et al.  Histogram-based traffic anomaly detection , 2009, IEEE Transactions on Network and Service Management.

[18]  Gregory R. Ganger,et al.  Diagnosing Performance Changes by Comparing Request Flows , 2011, NSDI.

[19]  Waheed Iqbal,et al.  SLA-Driven Automatic Bottleneck Detection and Resolution for Read Intensive Multi-tier Applications Hosted on a Cloud , 2010, GPC.

[20]  Tomasz Wiktorski,et al.  Adaptive Anomaly Detection in Cloud Using Robust and Scalable Principal Component Analysis , 2016, 2016 15th International Symposium on Parallel and Distributed Computing (ISPDC).

[21]  Abbas Jamalipour,et al.  Biologically Inspired Anomaly Detection and Security Control Frameworks for Complex Heterogeneous Networks , 2010, IEEE Transactions on Network and Service Management.

[22]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[23]  Yen-Kuang Chen,et al.  Challenges and opportunities of internet of things , 2012, 17th Asia and South Pacific Design Automation Conference.

[24]  Kavé Salamatian,et al.  Combining filtering and statistical methods for anomaly detection , 2005, IMC '05.

[25]  F. J. G. Gisbert Weighted samples, kernel density estimators and convergence , 2003 .

[26]  Charu C. Aggarwal,et al.  Outlier Analysis , 2013, Springer New York.

[27]  Mohamed Nadif,et al.  Adaptive Threshold for Anomaly Detection Using Time Series Segmentation , 2015, ICONIP.

[28]  Ahmed E. Hassan,et al.  Automated Detection of Performance Regressions Using Regression Models on Clustered Performance Counters , 2015, ICPE.

[29]  Bowen Zhou,et al.  Finding Needle in a Million Metrics: Anomaly Detection in a Large-scale Computational Advertising Platform , 2016, ArXiv.

[30]  Gerardo Canfora,et al.  An empirical comparison of methods to support QoS-aware service selection , 2010, PESOS '10.

[31]  Ahmed E. Hassan,et al.  Automated detection of performance regressions using statistical process control techniques , 2012, ICPE '12.

[32]  Pedro Casas,et al.  RCATool - A Framework for Detecting and Diagnosing Anomalies in Cellular Networks , 2015, 2015 27th International Teletraffic Congress.

[33]  Eric A. Brewer,et al.  Pinpoint: problem determination in large, dynamic Internet services , 2002, Proceedings International Conference on Dependable Systems and Networks.

[34]  George Athanasopoulos,et al.  Forecasting: principles and practice , 2013 .

[35]  Lars Grunske,et al.  Statistical detection of QoS violations based on CUSUM control charts , 2012, ICPE '12.

[36]  Tiejun Zhao,et al.  Self-adaptive statistical process control for anomaly detection in time series , 2016, Expert Syst. Appl..

[37]  Xiaohui Gu,et al.  PerfCompass: Toward Runtime Performance Anomaly Fault Localization for Infrastructure-as-a-Service Clouds , 2014, HotCloud.

[38]  Calton Pu,et al.  Experimental evaluation of N-tier systems: Observation and analysis of multi-bottlenecks , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[39]  Ali-Reza Rezaie,et al.  An automated forecasting method for workloads on web-based systems : Employing an adaptive method using splines to forecast seasonal time series with outliers , 2014 .

[40]  Dit-Yan Yeung,et al.  Parzen-window network intrusion detectors , 2002, Object recognition supported by user interaction for service robots.

[41]  Saeed Amizadeh,et al.  Generic and Scalable Framework for Automated Time-series Anomaly Detection , 2015, KDD.

[42]  Boris N. Oreshkin,et al.  Machine learning approaches to network anomaly detection , 2007 .

[43]  Jake D. Brutlag,et al.  Aberrant Behavior Detection in Time Series for Network Monitoring , 2000, LISA.

[44]  Rolf Stadler,et al.  Resource Management in Clouds: Survey and Research Challenges , 2015, Journal of Network and Systems Management.