AID: Efficient Prediction of Aggregated Intensity of Dependency in Large-scale Cloud Systems

Service reliability is one of the key challenges that cloud providers have to deal with. In cloud systems, unplanned service failures may cause severe cascading impacts on their dependent services, deteriorating customer satisfaction. Predicting the cascading impacts accurately and efficiently is critical to the operation and maintenance of cloud systems. Existing approaches identify whether one service depends on another via distributed tracing but no prior work focused on discriminating to what extent the dependency between cloud services is. In this paper, we survey the outages and the procedure for failure diagnosis in two cloud providers to motivate the definition of the intensity of dependency. We define the intensity of dependency between two services as how much the status of the callee service influences the caller service. Then we propose AID, the first approach to predict the intensity of dependencies between cloud services. AID first generates a set of candidate dependency pairs from the spans. AID then represents the status of each cloud service with a multivariate time series aggregated from the spans. With the representation of services, AID calculates the similarities between the statuses of the caller and the callee of each candidate pair. Finally, AID aggregates the similarities to produce a unified value as the intensity of the dependency. We evaluate AID on the data collected from an open-source microservice benchmark and a cloud system in production. The experimental results show that AID can efficiently and accurately predict the intensity of dependencies. We further demonstrate the usefulness of our method in a large-scale commercial cloud system.

[1]  Dan Ding,et al.  Fault Analysis and Debugging of Microservice Systems: Industrial Survey, Benchmark System, and Empirical Study , 2018, IEEE Transactions on Software Engineering.

[2]  Eamonn Keogh Exact Indexing of Dynamic Time Warping , 2002, VLDB.

[3]  Randy H. Katz,et al.  X-Trace: A Pervasive Network Tracing Framework , 2007, NSDI.

[4]  Songwu Lu,et al.  Dependency analysis of cloud applications for performance monitoring using recurrent neural networks , 2017, 2017 IEEE International Conference on Big Data (Big Data).

[5]  Dongmei Zhang,et al.  An Empirical Investigation of Incident Triage for Online Service Systems , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP).

[6]  Chirag Shah,et al.  Service Application Knowledge Graph and Dependency System , 2019, 2019 34th IEEE/ACM International Conference on Automated Software Engineering Workshop (ASEW).

[7]  Yuan He,et al.  Seer: Leveraging Big Data to Navigate the Complexity of Performance Debugging in Cloud Microservices , 2019, ASPLOS.

[8]  Karim M. El Defrawy,et al.  Automated Inference of Dependencies of Network Services and Applications via Transfer Entropy , 2016, 2016 IEEE 40th Annual Computer Software and Applications Conference (COMPSAC).

[9]  Alexander L. Wolf,et al.  On-Demand Discovery of Software Service Dependencies in MANETs , 2015, IEEE Transactions on Network and Service Management.

[10]  Randy H. Katz,et al.  Above the Clouds: A Berkeley View of Cloud Computing , 2009 .

[11]  Yu Luo,et al.  lprof: A Non-intrusive Request Flow Profiler for Distributed Systems , 2014, OSDI.

[12]  Donald Beaver,et al.  Dapper, a Large-Scale Distributed Systems Tracing Infrastructure , 2010 .

[13]  Thomas F. Wenisch,et al.  The Mystery Machine: End-to-end Performance Analysis of Large-scale Internet Services , 2014, OSDI.

[14]  Ding Yuan,et al.  Pensieve: Non-Intrusive Failure Reproduction for Distributed Systems using the Event Chaining Approach , 2017, SOSP.

[15]  S. Chiba,et al.  Dynamic programming algorithm optimization for spoken word recognition , 1978 .

[16]  Hamid Vakilzadian,et al.  A survey on time series data mining , 2017, 2017 IEEE International Conference on Electro Information Technology (EIT).

[17]  Ping Wang,et al.  CloudRanger: Root Cause Identification for Cloud Native Systems , 2018, 2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[18]  Yu Kang,et al.  Towards intelligent incident management: why we need it and how we make it , 2020, ESEC/SIGSOFT FSE.

[19]  Shang-Pin Ma,et al.  Using Service Dependency Graph to Analyze and Test Microservices , 2018, 2018 IEEE 42nd Annual Computer Software and Applications Conference (COMPSAC).

[20]  Rubby Casallas,et al.  Evaluating the monolithic and the microservice architecture pattern to deploy web applications in the cloud , 2015, 2015 10th Computing Colombian Conference (10CCC).

[21]  Qiang Fu,et al.  Mining dependency in distributed systems through unstructured logs analysis , 2010, OPSR.

[22]  David A. Patterson,et al.  Architecture and Dependability of Large-Scale Internet Services , 2002, IEEE Internet Comput..

[23]  Pooyan Jamshidi,et al.  Microservices Architecture Enables DevOps: Migration to a Cloud-Native Architecture , 2016, IEEE Software.

[24]  Richard Mortier,et al.  Magpie: Online Modelling and Performance-aware Systems , 2003, HotOS.

[25]  Christopher Krügel,et al.  Know Your Achilles' Heel: Automatic Detection of Network Critical Services , 2015, ACSAC.

[26]  Yangfan Zhou,et al.  Fast Outage Analysis of Large-Scale Production Clouds with Service Correlation Mining , 2021, 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE).

[27]  Antonio Pescapè,et al.  Cloud monitoring: A survey , 2013, Comput. Networks.

[28]  Donald J. Berndt,et al.  Using Dynamic Time Warping to Find Patterns in Time Series , 1994, KDD Workshop.

[29]  Zhaohui Wu,et al.  CloudScout: A Non-Intrusive Approach to Service Dependency Discovery , 2017, IEEE Transactions on Parallel and Distributed Systems.

[30]  Pengfei Chen,et al.  CauseInfer: Automatic and distributed performance diagnosis with hierarchical causality graph in large distributed systems , 2014, IEEE INFOCOM 2014 - IEEE Conference on Computer Communications.

[31]  Mohammad Zulkernine,et al.  CREM: A Cloud Reliability Evaluation Model , 2018, 2018 IEEE Global Communications Conference (GLOBECOM).

[32]  Christopher Krügel,et al.  Rippler: Delay injection for service dependency detection , 2014, IEEE INFOCOM 2014 - IEEE Conference on Computer Communications.

[33]  Trilce Estrada,et al.  Time Series Join on Subsequence Correlation , 2014, 2014 IEEE International Conference on Data Mining.

[34]  Vit Niennattrakul,et al.  On Clustering Multimedia Time Series Data Using K-Means and Dynamic Time Warping , 2007, 2007 International Conference on Multimedia and Ubiquitous Engineering (MUE'07).