Dependency analysis of cloud applications for performance monitoring using recurrent neural networks

Performance monitoring of cloud-native applications that consist of several micro-services involves the analysis of time series data collected from the infrastructure, platform, and application layers of the cloud software stack. The analysis of the runtime dependencies amongst the component micro-services is an essential step towards performing cloud resource management, detecting anomalous behavior of cloud applications, and meeting customer Service Level Agreements (SLAs). Finding such dependencies is challenging due to the non-linear nature of interactions, aberrant data measurements and lack of domain knowledge. In this paper, we propose a novel use of the modeling capability of Long-Short Term Memory (LSTM) recurrent neural networks, which excel in capturing temporal relationships in multi-variate time series data and being resilient to noisy pattern representations. Our proposed technique looks into the LSTM model structure, to uncover dependencies amongst performance metrics, which were learned during training. We further apply this technique in three monitoring use cases, namely finding the strongest performance predictors, discovering lagged/temporal dependencies, and improving the accuracy of forecasting for a given metric. We demonstrate the viability of our approach, by comparing the results of our proposed method in the three use cases with those obtained from previously proposed methods, such as Granger causality and the classical statistical time series analysis models, such as ARIMA and Holt-Winters. For our experiments and analysis, we use performance monitoring data collected from two sources: a controlled experiment involving a sample cloud application that we deployed in a public cloud infrastructure and cloud monitoring data collected from the monitoring service of an operational, public cloud service provider.

[1]  Marcos K. Aguilera,et al.  Performance debugging for distributed systems of black boxes , 2003, SOSP '03.

[2]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[3]  Kang-Won Lee,et al.  Application-aware virtual machine migration in data centers , 2011, 2011 Proceedings IEEE INFOCOM.

[4]  Richard A. Davis,et al.  Introduction to time series and forecasting , 1998 .

[5]  Les E. Atlas,et al.  Recurrent neural networks and robust time series prediction , 1994, IEEE Trans. Neural Networks.

[6]  Konstantinos Kalpakis,et al.  Distance measures for effective clustering of ARIMA time-series , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[7]  C. Granger Some recent development in a concept of causality , 1988 .

[8]  Zhaohui Wu,et al.  CloudScout: A Non-Intrusive Approach to Service Dependency Discovery , 2017, IEEE Transactions on Parallel and Distributed Systems.

[9]  Úlfar Erlingsson,et al.  Fay: extensible distributed tracing from kernels to clusters , 2011, SOSP '11.

[10]  Guoqiang Peter Zhang,et al.  Time series forecasting using a hybrid ARIMA and neural network model , 2003, Neurocomputing.

[11]  Paramvir Bahl,et al.  Towards highly reliable enterprise network services via inference of multi-level dependencies , 2007, SIGCOMM.

[12]  Xiaowei Jia,et al.  Incremental Dual-memory LSTM in Land Cover Prediction , 2017, KDD.

[13]  Chita R. Das,et al.  Modeling and synthesizing task placement constraints in Google compute clusters , 2011, SoCC.

[14]  Yan Liu,et al.  Recurrent Neural Networks for Multivariate Time Series with Missing Values , 2016, Scientific Reports.

[15]  Jiawei Han,et al.  Modeling Probabilistic Measurement Correlations for Problem Determination in Large-Scale Distributed Systems , 2009, 2009 29th IEEE International Conference on Distributed Computing Systems.

[16]  Chun Zhang,et al.  vPath: Precise Discovery of Request Processing Paths from Black-Box Observations of Thread and Network Activities , 2009, USENIX Annual Technical Conference.

[17]  Saurabh Bagchi,et al.  Automatic Problem Localization via Multi-dimensional Metric Profiling , 2013, 2013 IEEE 32nd International Symposium on Reliable Distributed Systems.

[18]  Richard Mortier,et al.  Using Magpie for Request Extraction and Workload Modelling , 2004, OSDI.

[19]  Alex Graves,et al.  Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[20]  Rob J Hyndman,et al.  Forecasting Time Series With Complex Seasonal Patterns Using Exponential Smoothing , 2011 .

[21]  Karsten Schwan,et al.  Look Who's Talking: Discovering Dependencies between Virtual Machines Using CPU Utilization , 2010, HotCloud.

[22]  Sam Newman,et al.  Building Microservices , 2015 .

[23]  Kishan G. Mehrotra,et al.  Forecasting the behavior of multivariate time series using neural networks , 1992, Neural Networks.

[24]  Yaochu Jin,et al.  Evolutionary multi-objective generation of recurrent neural network ensembles for time series prediction , 2014, Neurocomputing.

[25]  Xu Chen,et al.  Automating Network Application Dependency Discovery: Experiences, Limitations, and New Solutions , 2008, OSDI.

[26]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[27]  Mingzhou Ding,et al.  Estimating Granger causality from fourier and wavelet transforms of time series data. , 2007, Physical review letters.

[28]  Yan Liu,et al.  Temporal causal modeling with graphical granger methods , 2007, KDD '07.