TFix: Automatic Timeout Bug Fixing in Production Server Systems

Timeout is widely used to handle unexpected failures in distributed systems. However, improper use of timeout schemes can cause serious availability and performance issues, which is often difficult to fix due to lack of diagnostic information. In this paper, we present TFix, an automatic timeout bug fixing system for correcting misused timeout bugs in production systems. TFix adopts a drill-down bug analysis protocol that can narrow down the root cause of a misused timeout bug and producing recommendations for correcting the root cause. TFix first employs a system call frequent episode mining scheme to check whether a timeout bug is caused by a misused timeout variable. TFix then employs application tracing to identify timeout affected functions. Next, TFix uses taint analysis to localize the misused timeout variable. Last, TFix produces recommendations for proper timeout variable values based on the tracing results during normal runs. We have implemented a prototype of TFix and conducted extensive experiments using 13 real world server timeout bugs. Our experimental results show that TFix can correctly localize the misused timeout variables and suggest proper timeout values for fixing those bugs.

[1]  Tianyin Xu,et al.  Systems Approaches to Tackling Configuration Errors , 2015, ACM Comput. Surv..

[2]  Gabriele Bavota,et al.  An Empirical Investigation into Learning Bug-Fixing Patches in the Wild via Neural Machine Translation , 2018, 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[3]  Xiao Ma,et al.  An empirical study on configuration errors in commercial and open source systems , 2011, SOSP.

[4]  Shan Lu,et al.  Automated atomicity-violation fixing , 2011, PLDI '11.

[5]  Tianyin Xu,et al.  EnCore: exploiting system environment and correlation information for misconfiguration detection , 2014, ASPLOS.

[6]  Richard Mortier,et al.  Using Magpie for Request Extraction and Workload Modelling , 2004, OSDI.

[7]  David Lo,et al.  Identifying Linux bug fixing patches , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[8]  Xiaohui Gu,et al.  TScope: Automatic Timeout Bug Identification for Server Systems , 2018, 2018 IEEE International Conference on Autonomic Computing (ICAC).

[9]  Abhishek Kumar,et al.  Lightweight, High-Resolution Monitoring for Troubleshooting Production Systems , 2008, OSDI.

[10]  Junfeng Yang,et al.  Context-based Online Configuration-Error Detection , 2011, USENIX Annual Technical Conference.

[11]  Yuanyuan Zhou,et al.  Do not blame users for misconfigurations , 2013, SOSP.

[12]  Shan Lu,et al.  Understanding Real-World Timeout Problems in Cloud Server Systems , 2018, 2018 IEEE International Conference on Cloud Engineering (IC2E).

[13]  Michael D. Ernst,et al.  Automated diagnosis of software configuration errors , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[14]  Yuanyuan Zhou,et al.  Early Detection of Configuration Errors to Reduce Failure Damage , 2016, USENIX Annual Technical Conference.

[15]  Xiaohui Gu,et al.  PerfScope: Practical Online Server Performance Bug Inference in Production Cloud Computing Infrastructures , 2014, SoCC.

[16]  Peng Huang,et al.  ConfValley: a systematic configuration validation framework for cloud services , 2015, EuroSys.

[17]  Donald Beaver,et al.  Dapper, a Large-Scale Distributed Systems Tracing Infrastructure , 2010 .

[18]  Mona Attariyan,et al.  Automating Configuration Troubleshooting with Dynamic Information Flow Analysis , 2010, OSDI.

[19]  Tanakorn Leesatapornwongsa,et al.  What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems , 2014, SoCC.

[20]  Karsten Schwan,et al.  Understanding issue correlations: a case study of the Hadoop system , 2015, SoCC.

[21]  Michael D. Ernst,et al.  Automatically patching errors in deployed software , 2009, SOSP '09.

[22]  M. Desnoyers,et al.  The LTTng tracer: A low impact performance and behavior monitor for GNU/Linux , 2006 .

[23]  Wei Zhang,et al.  Automated Concurrency-Bug Fixing , 2012, OSDI.

[24]  Michel Dagenais,et al.  Analyzing blocking to debug performance problems on multi-core systems , 2010, OPSR.

[25]  Ben Niu,et al.  REPT: Reverse Debugging of Failures in Deployed Software , 2018, OSDI.

[26]  Mona Attariyan,et al.  X-ray: Automating Root-Cause Diagnosis of Performance Anomalies in Production Software , 2012, OSDI.