FIRM: An Intelligent Fine-Grained Resource Management Framework for SLO-Oriented Microservices

Modern user-facing latency-sensitive web services include numerous distributed, intercommunicating microservices that promise to simplify software development and operation. However, multiplexing of compute resources across microservices is still challenging in production because contention for shared resources can cause latency spikes that violate the service-level objectives (SLOs) of user requests. This paper presents FIRM, an intelligent fine-grained resource management framework for predictable sharing of resources across microservices to drive up overall utilization. FIRM leverages online telemetry data and machine-learning methods to adaptively (a) detect/localize microservices that cause SLO violations, (b) identify low-level resources in contention, and (c) take actions to mitigate SLO violations via dynamic reprovisioning. Experiments across four microservice benchmarks demonstrate that FIRM reduces SLO violations by up to 16x while reducing the overall requested CPU limit by up to 62%. Moreover, FIRM improves performance predictability by reducing tail latencies by up to 11x.

[1]  Thomas F. Wenisch,et al.  µTune: Auto-Tuned Threading for OLDI Microservices , 2018, OSDI.

[2]  Randy H. Katz,et al.  Heterogeneity and dynamicity of clouds at scale: Google trace analysis , 2012, SoCC '12.

[3]  Dan Ding,et al.  Fault Analysis and Debugging of Microservice Systems: Industrial Survey, Benchmark System, and Empirical Study , 2018, IEEE Transactions on Software Engineering.

[4]  Pooyan Jamshidi,et al.  Migrating to Cloud-Native Architectures Using Microservices: An Experience Report , 2015, ESOCC Workshops.

[5]  Marc Peter Deisenroth,et al.  Deep Reinforcement Learning: A Brief Survey , 2017, IEEE Signal Processing Magazine.

[6]  Kenneth B. Kent,et al.  Investigating resource interference and scaling on multitenant PaaS clouds , 2016, CASCON.

[7]  Peter Stone,et al.  Autonomous transfer for reinforcement learning , 2008, AAMAS.

[8]  Claus Pahl,et al.  Processes, Motivations, and Issues for Migrating to Microservices Architectures: An Empirical Investigation , 2017, IEEE Cloud Computing.

[9]  Klaus-Robert Müller,et al.  Incremental Support Vector Learning: Analysis, Implementation and Applications , 2006, J. Mach. Learn. Res..

[10]  Barton P. Miller,et al.  Critical path analysis for the execution of parallel and distributed programs , 1988, [1988] Proceedings. The 8th International Conference on Distributed.

[11]  Zibin Zheng,et al.  Microscaler: Automatic Scaling for Microservices with an Online Learning Approach , 2019, 2019 IEEE International Conference on Web Services (ICWS).

[12]  Ricardo Bianchini,et al.  DeepDive: Transparently Identifying and Managing Performance Interference in Virtualized Environments , 2013, USENIX Annual Technical Conference.

[13]  James M. LeBreton,et al.  Relative Importance Analysis: A Useful Supplement to Regression Analysis , 2011 .

[14]  David M. Brooks,et al.  Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[15]  Luiz André Barroso,et al.  The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines , 2009, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines.

[16]  Won-Taek Lim,et al.  Architectural support for operating system-driven CMP cache management , 2006, 2006 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[17]  Robert Shorten,et al.  On the modified AIMD algorithm for distributed resource management with saturation of each user's share , 2015, 2015 54th IEEE Conference on Decision and Control (CDC).

[18]  Carlo Curino,et al.  Morpheus: Towards Automated SLOs for Enterprise Clusters , 2016, OSDI.

[19]  I. Adan,et al.  QUEUEING THEORY , 1978 .

[20]  Hiranya Jayathilaka,et al.  Performance Monitoring and Root Cause Analysis for Cloud-hosted Web Applications , 2017, WWW.

[21]  Yuan He,et al.  Seer: Leveraging Big Data to Navigate the Complexity of Performance Debugging in Cloud Microservices , 2019, ASPLOS.

[22]  Michael Gerndt,et al.  Performance Modeling for Cloud Microservice Applications , 2019, ICPE.

[23]  Berkant Barla Cambazoglu,et al.  Impact of response latency on user behavior in web search , 2014, SIGIR.

[24]  Guillaume Pierre,et al.  Resource Provisioning of Web Applications in Heterogeneous Clouds , 2011, WebApps.

[25]  Jon Crowcroft,et al.  Distributed resource management with heterogeneous linear controls , 2004, Comput. Networks.

[26]  Christina Delimitrou,et al.  The Architectural Implications of Cloud Microservices , 2018, IEEE Computer Architecture Letters.

[27]  R. Eisinga,et al.  The reliability of a two-item scale: Pearson, Cronbach, or Spearman-Brown? , 2013, International Journal of Public Health.

[28]  Takuya Nakaike,et al.  Workload characterization for microservices , 2016, 2016 IEEE International Symposium on Workload Characterization (IISWC).

[29]  Alexandru Iosup,et al.  An Experimental Performance Evaluation of Autoscaling Policies for Complex Workflows , 2017, ICPE.

[30]  Dirk Merkel,et al.  Docker: lightweight Linux containers for consistent development and deployment , 2014 .

[31]  Randy H. Katz,et al.  X-Trace: A Pervasive Network Tracing Framework , 2007, NSDI.

[32]  Ramaswamy Chandramouli,et al.  Building secure microservices-based applications using service-mesh architecture , 2020 .

[33]  Thomas F. Wenisch,et al.  The Mystery Machine: End-to-end Performance Analysis of Large-scale Internet Services , 2014, OSDI.

[34]  Robert Babuska,et al.  A Survey of Actor-Critic Reinforcement Learning: Standard and Natural Policy Gradients , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[35]  Christina Delimitrou,et al.  iBench: Quantifying interference for datacenter applications , 2013, 2013 IEEE International Symposium on Workload Characterization (IISWC).

[36]  P. Sen,et al.  Introduction to bivariate and multivariate analysis , 1981 .

[37]  Christina Delimitrou,et al.  Quasar: resource-efficient and QoS-aware cluster management , 2014, ASPLOS.

[38]  Alim Ul Gias,et al.  ATOM: Model-Driven Autoscaling for Microservices , 2019, 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS).

[39]  Donald Beaver,et al.  Dapper, a Large-Scale Distributed Systems Tracing Infrastructure , 2010 .

[40]  Thomas F. Wenisch,et al.  μ Suite: A Benchmark Suite for Microservices , 2018, 2018 IEEE International Symposium on Workload Characterization (IISWC).

[41]  Richard Mortier,et al.  Magpie: Online Modelling and Performance-aware Systems , 2003, HotOS.

[42]  Eric A. Brewer,et al.  Pinpoint: problem determination in large, dynamic Internet services , 2002, Proceedings International Conference on Dependable Systems and Networks.

[43]  José Antonio Lozano,et al.  A Review of Auto-scaling Techniques for Elastic Applications in Cloud Environments , 2014, Journal of Grid Computing.

[44]  Christina Delimitrou,et al.  Workload characterization of interactive cloud services on big and small server platforms , 2017, 2017 IEEE International Symposium on Workload Characterization (IISWC).

[45]  Christof Fetzer,et al.  Sieve: actionable insights from monitored metrics in distributed systems , 2017, Middleware.

[46]  Gert Cauwenberghs,et al.  SVM incremental learning, adaptation and optimization , 2003, Proceedings of the International Joint Conference on Neural Networks, 2003..

[47]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[48]  Rodrigo N. Calheiros,et al.  Auto-scaling Web Applications in Clouds: A Taxonomy and Survey , 2016 .

[49]  Yuan He,et al.  An Open-Source Benchmark Suite for Microservices and Their Hardware-Software Implications for Cloud & Edge Systems , 2019, ASPLOS.

[50]  Yang Yang,et al.  Root cause analysis of anomalies of multitier services in public clouds , 2017, 2017 IEEE/ACM 25th International Symposium on Quality of Service (IWQoS).

[51]  Peter Stone,et al.  Transfer Learning for Reinforcement Learning Domains: A Survey , 2009, J. Mach. Learn. Res..

[52]  Jackson P. Matsuura,et al.  Using Transfer Learning to Speed-Up Reinforcement Learning: A Cased-Based Approach , 2010, 2010 Latin American Robotics Symposium and Intelligent Robotics Meeting.

[53]  Christina Delimitrou,et al.  Tarcil: reconciling scheduling speed and quality in large shared clusters , 2015, SoCC.

[54]  Praisan Padungweang,et al.  Auto-scaling microservices on IaaS under SLA with cost-effective framework , 2018, 2018 Tenth International Conference on Advanced Computational Intelligence (ICACI).

[55]  Pooyan Jamshidi,et al.  Microservices Architecture Enables DevOps: Migration to a Cloud-Native Architecture , 2016, IEEE Software.

[56]  Jieun Choi,et al.  Auto-scaling method in hybrid cloud for scientific applications , 2014, The 16th Asia-Pacific Network Operations and Management Symposium.

[57]  Gerhard Nahler,et al.  Pearson Correlation Coefficient , 2020, Definitions.

[58]  Richard Mortier,et al.  Using Magpie for Request Extraction and Workload Modelling , 2004, OSDI.

[59]  Ramesh Karri,et al.  Hardware Performance Counter-Based Malware Identification and Detection with Adaptive Compressive Sensing , 2016, ACM Trans. Archit. Code Optim..

[60]  Akshitha Sriraman,et al.  Accelerometer: Understanding Acceleration Opportunities for Data Center Overheads at Hyperscale , 2020, ASPLOS.

[61]  Zibin Zheng,et al.  Microscope: Pinpoint Performance Issues with Causal Graphs in Micro-service Environments , 2018, ICSOC.

[62]  Thomas F. Wenisch,et al.  SoftSKU: Optimizing Server Architectures for Microservice Diversity @Scale , 2019, 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA).

[63]  Padhraic Smyth,et al.  Algorithms for estimating relative importance in networks , 2003, KDD '03.

[64]  Karl Ott,et al.  Hardware Performance Counters for Embedded Software Anomaly Detection , 2018, 2018 IEEE 16th Intl Conf on Dependable, Autonomic and Secure Computing, 16th Intl Conf on Pervasive Intelligence and Computing, 4th Intl Conf on Big Data Intelligence and Computing and Cyber Science and Technology Congress(DASC/PiCom/DataCom/CyberSciTech).

[65]  Johan Tordsson,et al.  MicroRCA: Root Cause Localization of Performance Issues in Microservices , 2020, NOMS 2020 - 2020 IEEE/IFIP Network Operations and Management Symposium.

[66]  Abhishek Verma,et al.  Large-scale cluster management at Google with Borg , 2015, EuroSys.

[67]  Min Li,et al.  JCallGraph: Tracing Microservices in Very Large Scale Container Cloud Platforms , 2019, CLOUD.

[68]  Anees Shaikh,et al.  A Cost-Aware Elasticity Provisioning System for the Cloud , 2011, 2011 31st International Conference on Distributed Computing Systems.

[69]  K. G. Lockyer An introduction to critical path analysis , 1965 .

[70]  Petros Zerfos,et al.  Root Cause Detection using Dynamic Dependency Graphs from Time Series Data , 2018, 2018 IEEE International Conference on Big Data (Big Data).

[71]  Lingjia Tang,et al.  Whare-map: heterogeneity in "homogeneous" warehouse-scale computers , 2013, ISCA.

[72]  Christina Delimitrou,et al.  Paragon: QoS-aware scheduling for heterogeneous datacenters , 2013, ASPLOS '13.

[73]  Pramod Bhatotia,et al.  Cntr: Lightweight OS Containers , 2018, USENIX Annual Technical Conference.

[74]  Krzysztof Rzadca,et al.  Autopilot: workload autoscaling at Google , 2020, EuroSys.

[75]  Aman Kansal,et al.  Q-clouds: managing performance interference effects for QoS-aware clouds , 2010, EuroSys '10.

[76]  Vasiliki Kalavri,et al.  Three steps is all you need: fast, accurate, automatic scaling decisions for distributed streaming dataflows , 2018, OSDI.

[77]  Christina Delimitrou,et al.  Amdahl's law for tail latency , 2018, Commun. ACM.

[78]  Jerome A. Rolia,et al.  Workload Analysis and Demand Prediction of Enterprise Data Center Applications , 2007, 2007 IEEE 10th International Symposium on Workload Characterization.

[79]  Anja Feldmann,et al.  Revisiting Cacheability in Times of User Generated Content , 2010, 2010 INFOCOM IEEE Conference on Computer Communications Workshops.

[80]  Randy H. Katz,et al.  Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center , 2011, NSDI.

[81]  Sriram Ramabhadran,et al.  Cloud control with distributed rate limiting , 2007, SIGCOMM 2007.

[82]  Zhao Zhang,et al.  Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[83]  Subho Sankar Banerjee,et al.  Live Forensics for HPC Systems: A Case Study on Distributed Storage Systems , 2020, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.

[84]  Michael Abd-El-Malek,et al.  Omega: flexible, scalable schedulers for large compute clusters , 2013, EuroSys '13.

[85]  Jennifer Widom,et al.  Scaling personalized web search , 2003, WWW '03.

[86]  Christopher Stewart,et al.  Characterizing Service Level Objectives for Cloud Services: Realities and Myths , 2019, 2019 IEEE International Conference on Autonomic Computing (ICAC).

[87]  Klara Nahrstedt,et al.  MIRAS: Model-based Reinforcement Learning for Microservice Resource Allocation over Scientific Workflows , 2019, 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS).

[88]  Kaushik Veeraraghavan,et al.  Canopy: An End-to-End Performance Tracing And Analysis System , 2017, SOSP.

[89]  Luiz André Barroso,et al.  The tail at scale , 2013, CACM.