Characterizing Microservice Dependency and Performance: Alibaba Trace Analysis

Loosely-coupled and light-weight microservices running in containers are replacing monolithic applications gradually. Understanding the characteristics of microservices is critical to make good use of microservice architectures. However, there is no comprehensive study about microservice and its related systems in production environments so far. In this paper, we present a solid analysis of large-scale deployments of microservices at Alibaba clusters. Our study focuses on the characterization of microservice dependency as well as its runtime performance. We conduct an in-depth anatomy of microservice call graphs to quantify the difference between them and traditional DAGs of data-parallel jobs. In particular, we observe that microservice call graphs are heavy-tail distributed and their topology is similar to a tree and moreover, many microservices are hot-spots. We reveal three types of meaningful call dependency that can be utilized to optimize microservice designs. Our investigation on microservice runtime performance indicates most microservices are much more sensitive to CPU interference than memory interference. To synthesize more representative microservice traces, we build a mathematical model to simulate call graphs. Experimental results demonstrate our model can well preserve those graph properties observed from Alibaba traces.

[1]  Subho Sankar Banerjee,et al.  FIRM: An Intelligent Fine-Grained Resource Management Framework for SLO-Oriented Microservices , 2020, OSDI.

[2]  Christina Delimitrou,et al.  Sinan: ML-based and QoS-aware resource management for cloud microservices , 2021, ASPLOS.

[3]  Randy H. Katz,et al.  Heterogeneity and dynamicity of clouds at scale: Google trace analysis , 2012, SoCC '12.

[4]  Yuan He,et al.  An Open-Source Benchmark Suite for Microservices and Their Hardware-Software Implications for Cloud & Edge Systems , 2019, ASPLOS.

[5]  Ricardo Bianchini,et al.  Resource Central: Understanding and Predicting Workloads for Improved Resource Management in Large Cloud Platforms , 2017, SOSP.

[6]  M Novita,et al.  Properties of Burr distribution and its application to heavy-tailed survival time data , 2021 .

[7]  Xiao Zheng,et al.  High-density Multi-tenant Bare-metal Cloud , 2020, ASPLOS.

[8]  Yubin Xia,et al.  Characterizing serverless platforms with serverlessbench , 2020, SoCC.

[9]  Shang-Pin Ma,et al.  Graph-based and scenario-driven microservice analysis, retrieval, and testing , 2019, Future Gener. Comput. Syst..

[10]  Chao Li,et al.  Fuxi: a Fault-Tolerant Resource Management and Job Scheduling System at Internet Scale , 2014, Proc. VLDB Endow..

[11]  Takuya Nakaike,et al.  Workload characterization for microservices , 2016, 2016 IEEE International Symposium on Workload Characterization (IISWC).

[12]  Thomas F. Wenisch,et al.  µTune: Auto-Tuned Threading for OLDI Microservices , 2018, OSDI.

[13]  Alexandru Agache,et al.  Firecracker: Lightweight Virtualization for Serverless Applications , 2020, NSDI.

[14]  Jian Tang,et al.  InfoGraph: Unsupervised and Semi-supervised Graph-Level Representation Learning via Mutual Information Maximization , 2019, ICLR.

[15]  Adam Silberstein,et al.  Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[16]  Asser N. Tantawi,et al.  An analytical model for multi-tier internet services and its applications , 2005, SIGMETRICS '05.

[17]  Junfeng Yang,et al.  Scalable Overload Control for Large-scale Microservice Architecture , 2018, SOCC 2018.

[18]  Andrew S. Tanenbaum,et al.  Distributed systems: Principles and Paradigms , 2001 .

[19]  Zhibin Yu,et al.  The Elasticity and Plasticity in Semi-Containerized Co-locating Cloud Workload: a View from Alibaba Trace , 2018, SoCC.

[20]  Donald Beaver,et al.  Dapper, a Large-Scale Distributed Systems Tracing Infrastructure , 2010 .

[21]  Seif Haridi,et al.  Apache Flink™: Stream and Batch Processing in a Single Engine , 2015, IEEE Data Eng. Bull..

[22]  Silvia Esparrachiari,et al.  Tracking and controlling microservice dependencies , 2018, Commun. ACM.

[23]  Wonho Kim,et al.  Kraken: Leveraging Live Traffic Tests to Identify and Resolve Resource Utilization Bottlenecks in Large Scale Web Services , 2016, OSDI.

[24]  Sanjay Ghemawat,et al.  MapReduce: simplified data processing on large clusters , 2008, CACM.

[25]  Wei Wang,et al.  Characterizing and Synthesizing Task Dependencies of Data-Parallel Jobs in Alibaba Cloud , 2019, SoCC.

[26]  Yuan He,et al.  Seer: Leveraging Big Data to Navigate the Complexity of Performance Debugging in Cloud Microservices , 2019, ASPLOS.

[27]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[28]  Mor Harchol-Balter,et al.  Borg: the next generation , 2020, EuroSys.

[29]  Ricardo Bianchini,et al.  Serverless in the Wild: Characterizing and Optimizing the Serverless Workload at a Large Cloud Provider , 2020, USENIX ATC.

[30]  Jun Sun,et al.  Poster: Benchmarking Microservice Systems for Software Engineering Research , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering: Companion (ICSE-Companion).

[31]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[32]  Erich M. Nahum,et al.  Yaksha: a self-tuning controller for managing the performance of 3-tiered Web sites , 2004, Twelfth IEEE International Workshop on Quality of Service, 2004. IWQOS 2004..

[33]  Thomas F. Wenisch,et al.  μ Suite: A Benchmark Suite for Microservices , 2018, 2018 IEEE International Symposium on Workload Characterization (IISWC).

[34]  Gregory R. Ganger,et al.  On the diversity of cluster workloads and its impact on research results , 2018, USENIX Annual Technical Conference.