Sieve: Actionable Insights from Monitored Metrics in Microservices

Major cloud computing operators provide powerful monitoring tools to understand the current (and prior) state of the distributed systems deployed in their infrastructure. While such tools provide a detailed monitoring mechanism at scale, they also pose a significant challenge for the application developers/operators to transform the huge space of monitored metrics into useful insights. These insights are essential to build effective management tools for improving the efficiency, resiliency, and dependability of distributed systems. This paper reports on our experience with building and deploying Sieve - a platform to derive actionable insights from monitored metrics in distributed systems. Sieve builds on two core components: a metrics reduction framework, and a metrics dependency extractor. More specifically, Sieve first reduces the dimensionality of metrics by automatically filtering out unimportant metrics by observing their signal over time. Afterwards, Sieve infers metrics dependencies between distributed components of the system using a predictive-causality model by testing for Granger Causality. We implemented Sieve as a generic platform and deployed it for two microservices-based distributed systems: OpenStack and ShareLatex. Our experience shows that (1) Sieve can reduce the number of metrics by at least an order of magnitude (10 - 100$\times$), while preserving the statistical equivalence to the total number of monitored metrics; (2) Sieve can dramatically improve existing monitoring infrastructures by reducing the associated overheads over the entire system stack (CPU - 80%, storage - 90%, and network - 50%); (3) Lastly, Sieve can be effective to support a wide-range of workflows in distributed systems - we showcase two such workflows: orchestration of autoscaling, and Root Cause Analysis (RCA).

[1]  Le Yi Wang,et al.  VCONF: a reinforcement learning approach to virtual machines auto-configuration , 2009, ICAC '09.

[2]  Úlfar Erlingsson,et al.  Fay: extensible distributed tracing from kernels to clusters , 2011, SOSP '11.

[3]  Aniruddha S. Gokhale,et al.  Efficient Autoscaling in the Cloud Using Predictive Models for Workload Forecasting , 2011, 2011 IEEE 4th International Conference on Cloud Computing.

[4]  R. Rodrigues,et al.  Conductor: orchestrating the clouds , 2010, LADIS '10.

[5]  Lisandro Zambenedetti Granville,et al.  The interplay between timeliness and scalability in cloud monitoring systems , 2015, 2015 IEEE Symposium on Computers and Communication (ISCC).

[6]  Christof Fetzer,et al.  Sieve: actionable insights from monitored metrics in distributed systems , 2017, Middleware.

[7]  Pramod Bhatotia,et al.  Brief announcement: modelling MapReduce for optimal execution in the cloud , 2010, PODC.

[8]  Debbie L. Hahs-Vaughn,et al.  Statistical Concepts , 2012 .

[9]  Pramod Bhatotia,et al.  Incoop: MapReduce for incremental computations , 2011, SoCC.

[10]  Christof Fetzer,et al.  PrivApprox: Privacy-Preserving Stream Analytics , 2019, Informatik Spektrum.

[11]  Christof Fetzer,et al.  IncApprox: A Data Analytics System for Incremental Approximate Computing , 2016, WWW.

[12]  Johan Tordsson,et al.  Efficient provisioning of bursty scientific workloads on the cloud using adaptive elasticity control , 2012, ScienceCloud '12.

[13]  José Antonio Lozano,et al.  A Review of Auto-scaling Techniques for Elastic Applications in Cloud Environments , 2014, Journal of Grid Computing.

[14]  Nandini Mukherjee,et al.  Optimizing the utilization of virtual resources in Cloud environment , 2010, 2010 IEEE International Conference on Virtual Environments, Human-Computer Interfaces and Measurement Systems.

[15]  Christof Fetzer,et al.  StreamApprox: approximate computing for stream analytics , 2017, Middleware.

[16]  Amin Vahdat,et al.  Pip: Detecting the Unexpected in Distributed Systems , 2006, NSDI.

[17]  Pramod Bhatotia,et al.  Orchestrating the Deployment of Computations in the Cloud with Conductor , 2012, NSDI.

[18]  Ashish Gehani,et al.  SPADE: Support for Provenance Auditing in Distributed Environments , 2012, Middleware.

[19]  Donald Beaver,et al.  Dapper, a Large-Scale Distributed Systems Tracing Infrastructure , 2010 .

[20]  Suman Nath,et al.  Energy-Aware Server Provisioning and Load Dispatching for Connection-Intensive Internet Services , 2008, NSDI.

[21]  Sam Shah,et al.  Root cause detection in a service-oriented architecture , 2013, SIGMETRICS '13.

[22]  Li Wen-cha A reinforcement learning approach to virtual machines auto-configuration , 2014 .

[23]  Stefan Berchtold,et al.  Efficient Biased Sampling for Approximate Clustering and Outlier Detection in Large Data Sets , 2003, IEEE Trans. Knowl. Data Eng..

[24]  C. Granger,et al.  Spurious regressions in econometrics , 1974 .

[25]  Christof Fetzer,et al.  Lightweight Automatic Resource Scaling for Multi-tier Web Applications , 2014, 2014 IEEE 7th International Conference on Cloud Computing.

[26]  Jiawei Han,et al.  Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[27]  Randy H. Katz,et al.  X-Trace: A Pervasive Network Tracing Framework , 2007, NSDI.

[28]  Vijay Mann,et al.  Hansel: diagnosing faults in openStack , 2015, CoNEXT.

[29]  Thomas F. Wenisch,et al.  The Mystery Machine: End-to-end Performance Analysis of Large-scale Internet Services , 2014, OSDI.

[30]  W. Greene,et al.  计量经济分析 = Econometric analysis , 2009 .

[31]  D. Hand Statistical Concepts: A Second Course, Fourth Edition by Richard G. Lomax, Debbie L. Hahs‐Vaughn , 2012 .

[32]  Thomas Reidemeister,et al.  Dependency-aware fault diagnosis with metric-correlation models in enterprise software systems , 2010, 2010 International Conference on Network and Service Management.

[33]  Matthew A. Jaro,et al.  Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida , 1989 .

[34]  Rachid Guerraoui,et al.  Finding trojan message vulnerabilities in distributed systems , 2014, ASPLOS.

[35]  Bryan Cantrill,et al.  Dynamic Instrumentation of Production Systems , 2004, USENIX Annual Technical Conference, General Track.

[36]  Shicong Meng,et al.  Enhanced Monitoring-as-a-Service for Effective Cloud Management , 2013, IEEE Transactions on Computers.

[37]  Claudia Canali,et al.  An adaptive technique to model virtual machine behavior for scalable cloud monitoring , 2014, 2014 IEEE Symposium on Computers and Communications (ISCC).

[38]  Qi Zhang,et al.  A Regression-Based Analytic Model for Dynamic Resource Provisioning of Multi-Tier Applications , 2007, Fourth International Conference on Autonomic Computing (ICAC'07).

[39]  Thomas Moyer,et al.  Trustworthy Whole-System Provenance for the Linux Kernel , 2015, USENIX Security Symposium.

[40]  Pramod Bhatotia,et al.  iThreads: A Threading Library for Parallel Incremental Computation , 2015, ASPLOS.

[41]  James Bailey,et al.  Information theoretic measures for clusterings comparison: is a correction for chance necessary? , 2009, ICML '09.

[42]  Moustafa Ghanem,et al.  Future Generation Computer Systems ( ) – Future Generation Computer Systems Enabling Cost-aware and Adaptive Elasticity of Multi-tier Cloud Applications , 2022 .

[43]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[44]  C. Granger Investigating causal relations by econometric models and cross-spectral methods , 1969 .

[45]  Jun M. Liu,et al.  Nonlinear Time Series Modeling Using Spline-based Nonparametric Models , 2009 .

[46]  Sukrit Kalra,et al.  GRETEL: Lightweight Fault Localization for OpenStack , 2016, CoNEXT.

[47]  Jing Cao,et al.  Combining Sampling Technique with DBSCAN Algorithm for Clustering Large Spatial Databases , 2000, PAKDD.

[48]  Rodrigo Fonseca,et al.  Pivot tracing , 2018, USENIX ATC.

[49]  Christof Fetzer,et al.  INSPECTOR: Data Provenance Using Intel Processor Trace (PT) , 2016, 2016 IEEE 36th International Conference on Distributed Computing Systems (ICDCS).

[50]  Luis Gravano,et al.  k-Shape: Efficient and Accurate Clustering of Time Series , 2016, SGMD.

[51]  Rajarshi Das,et al.  A Hybrid Reinforcement Learning Approach to Autonomic Resource Allocation , 2006, 2006 IEEE International Conference on Autonomic Computing.

[52]  Qiang Fu,et al.  YADING: Fast Clustering of Large-Scale Time Series Data , 2015, Proc. VLDB Endow..

[53]  Pramod Bhatotia,et al.  Incremental parallel and distributed systems , 2015 .

[54]  Marcos K. Aguilera,et al.  Performance debugging for distributed systems of black boxes , 2003, SOSP '03.

[55]  Karl Pearson F.R.S. LIII. On lines and planes of closest fit to systems of points in space , 1901 .

[56]  Santosh S. Vempala,et al.  Latent semantic indexing: a probabilistic analysis , 1998, PODS '98.

[57]  Vyas Sekar,et al.  Gremlin: Systematic Resilience Testing of Microservices , 2016, 2016 IEEE 36th International Conference on Distributed Computing Systems (ICDCS).