Hytrace: A Hybrid Approach to Performance Bug Diagnosis in Production Cloud Infrastructures

Server applications running inside production cloud infrastructures are prone to various performance problems (e.g., software hang, performance slowdown). When those problems occur, developers often have little clue to diagnose those problems. In this paper, we present Hytrace, a novel hybrid approach to diagnosing performance problems in production cloud infrastructures. Hytrace combines rule-based static analysis and runtime inference techniques to achieve higher bug localization accuracy than pure-static and pure-dynamic approaches for performance bugs. Hytrace does not require source code and can be applied to both compiled and interpreted programs such as C/C++ and Java. We conduct experiments using real performance bugs from seven commonly used server applications in production cloud infrastructures. The results show that our approach can significantly improve the performance bug diagnosis accuracy compared to existing diagnosis techniques.

[1]  Vanish Talwar,et al.  Monalytics: online monitoring and analytics for managing large scale data centers , 2010, ICAC '10.

[2]  Michael I. Jordan,et al.  Detecting large-scale system problems by mining console logs , 2009, SOSP '09.

[3]  Naren Ramakrishnan,et al.  Efficient Episode Mining of Dynamic Event Streams , 2012, 2012 IEEE 12th International Conference on Data Mining.

[4]  Mona Attariyan,et al.  X-ray: Automating Root-Cause Diagnosis of Performance Anomalies in Production Software , 2012, OSDI.

[5]  Shan Lu,et al.  Production-run software failure diagnosis via hardware performance counters , 2013, ASPLOS '13.

[6]  Xiaohui Gu,et al.  Ieee Transactions on Parallel and Distributed Systems (tpds) Perfcompass: Online Performance Anomaly Fault Localization and Inference in Infrastructure-as-a-service Clouds , 2022 .

[7]  Helen J. Wang,et al.  Strider: a black-box, state-based approach to change and configuration management and support , 2003, Sci. Comput. Program..

[8]  Abhishek Kumar,et al.  Lightweight, High-Resolution Monitoring for Troubleshooting Production Systems , 2008, OSDI.

[9]  Shan Lu,et al.  Pcatch: automatically detecting performance cascading bugs in cloud systems , 2018, EuroSys.

[10]  Shan Lu,et al.  CARAMEL: Detecting and Fixing Performance Problems That Have Non-Intrusive Fixes , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[11]  Xiaohui Gu,et al.  PREPARE: Predictive Performance Anomaly Prevention for Virtualized Cloud Systems , 2012, 2012 IEEE 32nd International Conference on Distributed Computing Systems.

[12]  George Varghese,et al.  Gestalt: Fast, Unified Fault Localization for Networked Systems , 2014, USENIX Annual Technical Conference.

[13]  Rajeev Gandhi,et al.  Black-Box Problem Diagnosis in Parallel File Systems , 2010, FAST.

[14]  Shan Lu,et al.  Understanding and detecting real-world performance bugs , 2012, PLDI.

[15]  Sarfraz Khurshid,et al.  An empirical study of long lived bugs , 2014, 2014 Software Evolution Week - IEEE Conference on Software Maintenance, Reengineering, and Reverse Engineering (CSMR-WCRE).

[16]  Xiaohui Gu,et al.  UBL: unsupervised behavior learning for predicting performance anomalies in virtualized cloud systems , 2012, ICAC '12.

[17]  Shan Lu,et al.  Statistical debugging for real-world performance problems , 2014, OOPSLA.

[18]  Dongmei Zhang,et al.  Performance debugging in the large via mining millions of stack traces , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[19]  Shan Lu,et al.  Automated atomicity-violation fixing , 2011, PLDI '11.

[20]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.

[21]  Xiaohui Gu,et al.  Insight: In-situ Online Service Failure Path Inference in Production Computing Infrastructures , 2014, USENIX Annual Technical Conference.

[22]  Shan Lu,et al.  Toddler: Detecting performance problems via similar memory-access patterns , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[23]  Xiaohui Gu,et al.  PerfScope: Practical Online Server Performance Bug Inference in Production Cloud Computing Infrastructures , 2014, SoCC.

[24]  Xiao Yu,et al.  CloudSeer: Workflow Monitoring of Cloud Infrastructures via Interleaved Logs , 2016, ASPLOS.

[25]  Xiangyu Zhang,et al.  IntroPerf: transparent context-sensitive multi-layer performance inference using system stack traces , 2014, SIGMETRICS '14.

[26]  Ahmed E. Hassan,et al.  Detecting performance anti-patterns for applications developed using object-relational mapping , 2014, ICSE.

[27]  Vipin Kumar,et al.  Introduction to Data Mining , 2022, Data Mining and Machine Learning Applications.

[28]  Trishul M. Chilimbi,et al.  HOLMES: Effective statistical debugging via efficient path profiling , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[29]  Dongmei Zhang,et al.  Context-sensitive delta inference for identifying workload-dependent performance bottlenecks , 2013, ISSTA.

[30]  Marcos K. Aguilera,et al.  Performance debugging for distributed systems of black boxes , 2003, SOSP '03.

[31]  Wei Zhang,et al.  Automated Concurrency-Bug Fixing , 2012, OSDI.

[32]  Xin Li,et al.  Reference-driven performance anomaly identification , 2009, SIGMETRICS '09.

[33]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .