Anomaly detection for scientific workflow applications on networked clouds

Recent advances in cloud technologies and on-demand network circuits have created an unprecedented opportunity to enable complex scientific workflow applications to run on dynamic, networked cloud infrastructure. However, it is extremely challenging to reliably execute these workflows on distributed clouds because performance anomalies and faults are frequent in these systems. Hence, accurate, automatic, proactive, online detection of anomalies is extremely important to pinpoint the time and source of the observed anomaly and to guide the adaptation of application and infrastructure. In this work, we present an anomaly detection algorithm that uses auto-regression (AR) based statistical methods on online monitoring time-series data to detect performance anomalies when scientific workflows and applications execute on networked cloud systems. We present a thorough evaluation of our auto-regression based anomaly detection approach by injecting artificial, competing loads into the system. Results show that our AR based detection algorithm can accurately detect performance anomalies for a variety of exemplar scientific workflows and applications.

[1]  Douglas Thain,et al.  Practical Resource Monitoring for Robust High Throughput Computing , 2015, 2015 IEEE International Conference on Cluster Computing.

[2]  Clifford M. Hurvich,et al.  Regression and time series model selection in small samples , 1989 .

[3]  Sameer Singh,et al.  Novelty detection: a review - part 1: statistical approaches , 2003, Signal Process..

[4]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[5]  Takehisa Yairi,et al.  An Anomaly Detection Method for Spacecraft Using Relevance Vector Learning , 2005, PAKDD.

[6]  T. A. Lasinski,et al.  THE NAS PARALLELBENCHMARKS , 1991 .

[7]  Ewa Deelman,et al.  Fault Tolerant Clustering in Scientific Workflows , 2012, 2012 IEEE Eighth World Congress on Services.

[8]  Inderveer Chana,et al.  Intelligent failure prediction models for scientific workflows , 2015, Expert Syst. Appl..

[9]  Adam Arbree,et al.  Mapping Abstract Complex Workflows onto Grid Environments , 2003, Journal of Grid Computing.

[10]  Douglas Thain,et al.  Distributed computing in practice: the Condor experience , 2005, Concurr. Pract. Exp..

[11]  David H. Bailey,et al.  The NAS parallel benchmarks summary and preliminary results , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[12]  F. E. Grubbs Procedures for Detecting Outlying Observations in Samples , 1969 .

[13]  Emma S. Buneci Qualitative Performance Analysis for Large-Scale Scientific Workflows , 2008 .

[14]  Miron Livny,et al.  dV/dt - Accelerating the Rate of Progress towards Extreme Scale Collaborative Science , 2018 .

[15]  Aydan R. Yumerefendi,et al.  Beyond Virtual Data Centers : Toward an Open Resource Control Architecture , 2007 .

[16]  Michael Wilde,et al.  Kickstarting remote applications , 2006 .

[17]  X. Shao,et al.  Simultaneous Wavelength Selection and Outlier Detection in Multivariate Regression of Near-Infrared Spectra , 2005, Analytical sciences : the international journal of the Japan Society for Analytical Chemistry.

[18]  Ana Bianco,et al.  Outlier Detection in Regression Models with ARIMA Errors Using Robust Estimates , 2001 .

[19]  Ewa Deelman,et al.  Online Fault and Anomaly Detection for Large-Scale Scientific Workflows , 2011, 2011 IEEE International Conference on High Performance Computing and Communications.

[20]  Xiaohui Helen Gu,et al.  Online performance anomaly prediction and prevention for complex distributed systems , 2012 .

[21]  G. Bruce Berriman,et al.  The Application of Cloud Computing to Astronomy: A Study of Cost and Performance , 2010, 2010 Sixth IEEE International Conference on e-Science Workshops.

[22]  Daniel S. Katz,et al.  Pegasus: A framework for mapping complex scientific workflows onto distributed systems , 2005, Sci. Program..

[23]  Jeffrey S. Chase,et al.  ExoGENI: A Multi-Domain Infrastructure-as-a-Service Testbed , 2012, The GENI Book.

[24]  Huan Liu,et al.  Advances in Knowledge Discovery and Data Mining, 9th Pacific-Asia Conference, PAKDD 2005, Hanoi, Vietnam, May 18-20, 2005, Proceedings , 2005, PAKDD.

[25]  Xiaohui Gu,et al.  UBL: unsupervised behavior learning for predicting performance anomalies in virtualized cloud systems , 2012, ICAC '12.