REPT: Reverse Debugging of Failures in Deployed Software

Debugging software failures in deployed systems is important because they impact real users and customers. However, debugging such failures is notoriously hard in practice because developers have to rely on limited information such as memory dumps. The execution history is usually unavailable because high-fidelity program tracing is not affordable in deployed systems. In this paper, we present REPT, a practical system that enables reverse debugging of software failures in deployed systems. REPT reconstructs the execution history with high fidelity by combining online lightweight hardware tracing of a program's control flow with offline binary analysis that recovers its data flow. It is seemingly impossible to recover data values thousands of instructions before the failure due to information loss and concurrent execution. REPT tackles these challenges by constructing a partial execution order based on timestamps logged by hardware and iteratively performing forward and backward execution with error correction. We design and implement REPT, deploy it on Microsoft Windows, and integrate it into WinDbg. We evaluate REPT on 16 real-world bugs and show that it can recover data values accurately (92% on average) and efficiently (in less than 20 seconds) for these bugs. We also show that it enables effective reverse debugging for 14 bugs.

[1]  J. Engblom,et al.  A review of reverse debugging , 2012, Proceedings of the 2012 System, Software, SoC and Silicon Debug Conference.

[2]  George Candea,et al.  Cloud9: a software testing service , 2010, OPSR.

[3]  Mary Jean Harrold,et al.  Empirical evaluation of the tarantula automatic fault-localization technique , 2005, ASE.

[4]  George Candea,et al.  Execution synthesis: a technique for automated software debugging , 2010, EuroSys '10.

[5]  George Candea,et al.  Failure sketching: a technique for automated root cause diagnosis of in-production failures , 2015, SOSP.

[6]  Peng Liu,et al.  Postmortem Program Analysis with Hardware-Enhanced Post-Crash Artifacts , 2017, USENIX Security Symposium.

[7]  Manu Sridharan,et al.  PSE: explaining program failures via postmortem static analysis , 2004, SIGSOFT '04/FSE-12.

[8]  Yanick Fratantonio,et al.  RETracer: Triaging Crashes by Reverse Execution from Partial Memory Dumps , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[9]  Tong Zhang,et al.  ProRace: Practical Data Race Detection for Production Use , 2017, ASPLOS.

[10]  Josep Torrellas,et al.  DeLorean: Recording and Deterministically Replaying Shared-Memory Multiprocessor Execution Ef?ciently , 2008, International Symposium on Computer Architecture.

[11]  Satish Narayanasamy,et al.  DoublePlay: parallelizing sequential logging and replay , 2011, ASPLOS XVI.

[12]  Satish Narayanasamy,et al.  BugNet: continuously recording program execution for deterministic replay debugging , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[13]  Nachiappan Nagappan,et al.  Concurrency at Microsoft – An Exploratory Survey , 2008 .

[14]  George Candea,et al.  Enabling Sophisticated Analysis of x86 Binaries with RevGen , 2011, HotDep 2011.

[15]  Peter Zoeteweij,et al.  An Evaluation of Similarity Coefficients for Software Fault Localization , 2006, 2006 12th Pacific Rim International Symposium on Dependable Computing (PRDC'06).

[16]  Jeff Huang,et al.  Towards Production-Run Heisenbugs Reproduction on Commercial Hardware , 2017, USENIX Annual Technical Conference.

[17]  Trishul M. Chilimbi,et al.  HOLMES: Effective statistical debugging via efficient path profiling , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[18]  Shan Lu,et al.  Instrumentation and sampling strategies for cooperative concurrency bug isolation , 2010, OOPSLA.

[19]  Andreas Zeller,et al.  Simplifying and Isolating Failure-Inducing Input , 2002, IEEE Trans. Software Eng..

[20]  Josep Torrellas,et al.  DeLorean: Recording and Deterministically Replaying Shared-Memory Multiprocessor Execution Ef?ciently , 2008, 2008 International Symposium on Computer Architecture.

[21]  Miguel Castro,et al.  Better bug reporting with better privacy , 2008, ASPLOS 2008.

[22]  Marek Olszewski,et al.  Kendo: efficient deterministic multithreading in software , 2009, ASPLOS.

[23]  S. Rajamani,et al.  A decade of software model checking with SLAM , 2011, Commun. ACM.

[24]  Alex Aiken,et al.  Cooperative Bug Isolation , 2007 .

[25]  Haoxiang Lin,et al.  MODIST: Transparent Model Checking of Unmodified Distributed Systems , 2009, NSDI.

[26]  Shan Lu,et al.  Production-run software failure diagnosis via hardware performance counters , 2013, ASPLOS '13.

[27]  Yuanyuan Zhou,et al.  PRES: probabilistic replay with execution sketching on multiprocessors , 2009, SOSP '09.

[28]  Ding Yuan,et al.  How do fixes become bugs? , 2011, ESEC/FSE '11.

[29]  Ali-Reza Adl-Tabatabai,et al.  CoreRacer: A practical memory race recorder for multicore x86 TSO processors , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[30]  Dawson R. Engler,et al.  KLEE: Unassisted and Automatic Generation of High-Coverage Tests for Complex Systems Programs , 2008, OSDI.

[31]  Josep Torrellas,et al.  Capo: a software-hardware interface for practical deterministic multiprocessor replay , 2009, ASPLOS.

[32]  Tal Garfinkel,et al.  Towards Practical Default-On Multi-Core Record/Replay , 2017, ASPLOS.

[33]  Ben Niu,et al.  Lazy Diagnosis of In-Production Concurrency Bugs , 2017, SOSP.

[34]  Shan Lu,et al.  Leveraging the short-term memory of hardware to diagnose production-run software failures , 2014, ASPLOS.