Hytrace: A Hybrid Approach to Performance Bug Diagnosis in Production Cloud Infrastructures

Server applications running inside production cloud infrastructures are prone to various performance problems (e.g., software hang, performance slowdown). When those problems occur, developers often have little clue to diagnose those problems. In this paper, we present Hytrace, a novel hybrid approach to diagnosing performance problems in production cloud infrastructures. Hytrace combines rule-based static analysis and runtime inference techniques to achieve higher bug localization accuracy than pure-static and pure-dynamic approaches for performance bugs. Hytrace does not require source code and can be applied to both compiled and interpreted programs such as C/C++ and Java. We conduct experiments using real performance bugs from seven commonly used server applications in production cloud infrastructures. The results show that our approach can significantly improve the performance bug diagnosis accuracy compared to existing diagnosis techniques.

[1]  Xiangyu Zhang,et al.  IntroPerf: transparent context-sensitive multi-layer performance inference using system stack traces , 2014, SIGMETRICS '14.

[2]  Shan Lu,et al.  Production-run software failure diagnosis via hardware performance counters , 2013, ASPLOS '13.

[3]  Ahmed E. Hassan,et al.  Detecting performance anti-patterns for applications developed using object-relational mapping , 2014, ICSE.

[4]  Michael I. Jordan,et al.  Detecting large-scale system problems by mining console logs , 2009, SOSP '09.

[5]  Mona Attariyan,et al.  X-ray: Automating Root-Cause Diagnosis of Performance Anomalies in Production Software , 2012, OSDI.

[6]  Xiaohui Gu,et al.  UBL: unsupervised behavior learning for predicting performance anomalies in virtualized cloud systems , 2012, ICAC '12.

[7]  Xiaohui Gu,et al.  Ieee Transactions on Parallel and Distributed Systems (tpds) Perfcompass: Online Performance Anomaly Fault Localization and Inference in Infrastructure-as-a-service Clouds , 2022 .

[8]  Daniel T. Larose,et al.  Discovering Knowledge in Data: An Introduction to Data Mining , 2005 .

[9]  Shan Lu,et al.  Toddler: Detecting performance problems via similar memory-access patterns , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[10]  Vanish Talwar,et al.  Monalytics: online monitoring and analytics for managing large scale data centers , 2010, ICAC '10.

[11]  Naren Ramakrishnan,et al.  Efficient Episode Mining of Dynamic Event Streams , 2012, 2012 IEEE 12th International Conference on Data Mining.

[12]  Dongmei Zhang,et al.  Performance debugging in the large via mining millions of stack traces , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[13]  Shan Lu,et al.  Understanding and detecting real-world performance bugs , 2012, PLDI.

[14]  Helen J. Wang,et al.  Strider: a black-box, state-based approach to change and configuration management and support , 2003, Sci. Comput. Program..

[15]  Alessandro Orso,et al.  BugRedux: Reproducing field failures for in-house debugging , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[16]  Xiaohui Gu,et al.  FChain: Toward Black-Box Online Fault Localization for Cloud Systems , 2013, 2013 IEEE 33rd International Conference on Distributed Computing Systems.

[17]  Shan Lu,et al.  Automated atomicity-violation fixing , 2011, PLDI '11.

[18]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.

[19]  Xiaohui Gu,et al.  Insight: In-situ Online Service Failure Path Inference in Production Computing Infrastructures , 2014, USENIX Annual Technical Conference.

[20]  Xiao Yu,et al.  CloudSeer: Workflow Monitoring of Cloud Infrastructures via Interleaved Logs , 2016, ASPLOS.

[21]  Xin Li,et al.  Reference-driven performance anomaly identification , 2009, SIGMETRICS '09.

[22]  Trishul M. Chilimbi,et al.  HOLMES: Effective statistical debugging via efficient path profiling , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[23]  Shan Lu,et al.  Statistical debugging for real-world performance problems , 2014, OOPSLA.

[24]  Dongmei Zhang,et al.  Context-sensitive delta inference for identifying workload-dependent performance bottlenecks , 2013, ISSTA.

[25]  George Varghese,et al.  Gestalt: Fast, Unified Fault Localization for Networked Systems , 2014, USENIX Annual Technical Conference.

[26]  Rajeev Gandhi,et al.  Black-Box Problem Diagnosis in Parallel File Systems , 2010, FAST.

[27]  M. Desnoyers,et al.  The LTTng tracer: A low impact performance and behavior monitor for GNU/Linux , 2006 .

[28]  Marcos K. Aguilera,et al.  Performance debugging for distributed systems of black boxes , 2003, SOSP '03.

[29]  Wei Zhang,et al.  Automated Concurrency-Bug Fixing , 2012, OSDI.

[30]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[31]  Sarfraz Khurshid,et al.  An empirical study of long lived bugs , 2014, 2014 Software Evolution Week - IEEE Conference on Software Maintenance, Reengineering, and Reverse Engineering (CSMR-WCRE).

[32]  Shan Lu,et al.  Pcatch: automatically detecting performance cascading bugs in cloud systems , 2018, EuroSys.

[33]  Shan Lu,et al.  CARAMEL: Detecting and Fixing Performance Problems That Have Non-Intrusive Fixes , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[34]  Xiaohui Gu,et al.  PREPARE: Predictive Performance Anomaly Prevention for Virtualized Cloud Systems , 2012, 2012 IEEE 32nd International Conference on Distributed Computing Systems.

[35]  Shan Lu,et al.  Hytrace: A Hybrid Approach to Performance Bug Diagnosis in Production Cloud Infrastructures , 2019, IEEE Trans. Parallel Distributed Syst..

[36]  Abhishek Kumar,et al.  Lightweight, High-Resolution Monitoring for Troubleshooting Production Systems , 2008, OSDI.

[37]  Xiaohui Gu,et al.  PerfScope: Practical Online Server Performance Bug Inference in Production Cloud Computing Infrastructures , 2014, SoCC.