Automatic Server Hang Bug Diagnosis: Feasible Reality or Pipe Dream?

It is notoriously difficult to diagnose server hang bugs as they often generate little diagnostic information and are difficult to reproduce offline. In this paper, we present a characteristic study of 177 real software hang bugs from 8 common open source server systems (i.e., Apache, Lighttpd, My SQL, Squid, HDFS, Hadoop Mapreduce, Tomcat, Cassandra). We identify three major root cause categories (i.e., Programmer errors, mishandled values, concurrency issues). We then describe two major problems (i.e., False positives and false negatives) while applying existing rule-based bug detection techniques to those bugs.

[1]  Yu Luo,et al.  Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems , 2014, OSDI.

[2]  日経BP社,et al.  Amazon Web Services完全ソリューションガイド , 2016 .

[3]  Shan Lu,et al.  Automated atomicity-violation fixing , 2011, PLDI '11.

[4]  David Lie,et al.  Kivati: fast detection and prevention of atomicity violations , 2010, EuroSys '10.

[5]  Shan Lu,et al.  Understanding and detecting real-world performance bugs , 2012, PLDI.

[6]  David Hovemeyer,et al.  Using Static Analysis to Find Bugs , 2008, IEEE Software.

[7]  Frank Tip,et al.  Associating synchronization constraints with data in an object-oriented language , 2006, POPL '06.

[8]  Haoxiang Lin,et al.  Hang analysis: fighting responsiveness bugs , 2008, Eurosys '08.

[9]  Sandeep S. Kulkarni,et al.  Automatic repair for multi-threaded programs with Deadlock/Livelock using maximum satisfiability , 2014, ISSTA 2014.

[10]  Xiaohui Gu,et al.  UBL: unsupervised behavior learning for predicting performance anomalies in virtualized cloud systems , 2012, ICAC '12.

[11]  Scott A. Mahlke,et al.  Gadara: Dynamic Deadlock Avoidance for Multithreaded Programs , 2008, OSDI.

[12]  Xiangyu Zhang,et al.  Accentuating the positive: atomicity inference and enforcement using correct executions , 2011, OOPSLA '11.

[13]  Xiaohui Gu,et al.  PerfCompass: Toward Runtime Performance Anomaly Fault Localization for Infrastructure-as-a-Service Clouds , 2014, HotCloud.

[14]  Samuel P. Midkiff,et al.  Automatic atomic region identification in shared memory SPMD programs , 2010, OOPSLA.

[15]  Xiaohui Gu,et al.  PerfScope: Practical Online Server Performance Bug Inference in Production Cloud Computing Infrastructures , 2014, SoCC.

[16]  Wei Zhang,et al.  Automated Concurrency-Bug Fixing , 2012, OSDI.

[17]  Junfeng Yang,et al.  Bypassing Races in Live Applications with Execution Filters , 2010, OSDI.

[18]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.