CURRF: A Code-Based Framework for Faithful Replay Distributed Applications

Debugging distributed system is programmer's nightmare be-cause of non-determinism bugs. Those Non-repeatable bugs force developers back to the outdated and time consuming techniques such as printf and log mining for investigation. To relieve this issue, record and replay mechanisms have been proposed. These methods allow developers use their cyclic debug skill in non-deterministic situations. In this pa-per, we present the design and implementation of CURRF: a Code-based, fully User-space light-weighted Record and Replay Framework. In CURRF, developers can easily re-play individual processes in a large-scaled distributed sys-tem without touching other components. It achieves this goal by introducing a code annotation mechanism. Programmers can write the annotations in the source code and notify the logger for the critical non-deterministic operations. CURRF is much more flexible and easier to use than previous record and replay solutions such as R2. In this paper, we demonstrate the efficiency and usefulness of CURRF for new applications as well as legacy programs. The experiment results show that CURRF introduces very little interference when the program is running in debugging mode.

[1]  Amin Vahdat,et al.  Pip: Detecting the Unexpected in Distributed Systems , 2006, NSDI.

[2]  Xuezheng Liu,et al.  D3S: Debugging Deployed Distributed Systems , 2008, NSDI.

[3]  Koen De Bosschere,et al.  RecPlay: a fully integrated practical record/replay system , 1999, TOCS.

[4]  Samuel T. King,et al.  ReVirt: enabling intrusion analysis through virtual-machine logging and replay , 2002, OPSR.

[5]  Wei Lin,et al.  Towards Pragmatic Library-based Replay , 2008 .

[6]  Scott Shenker,et al.  Replay debugging for distributed applications , 2006 .

[7]  Ion Stoica,et al.  Friday: Global Comprehension for Distributed Replay , 2007, NSDI.

[8]  Jong-Deok Choi,et al.  Deterministic replay of distributed Java applications , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[9]  Wei Lin,et al.  WiDS Checker: Combating Bugs in Distributed Systems , 2007, NSDI.

[10]  Joel Huselius,et al.  Debugging Parallel Systems: A State of the Art Report , 2002 .

[11]  Min Xu,et al.  A "flight data recorder" for enabling full-system multiprocessor deterministic replay , 2003, ISCA '03.

[12]  Xuezheng Liu,et al.  Usenix Association 8th Usenix Symposium on Operating Systems Design and Implementation R2: an Application-level Kernel for Record and Replay , 2022 .

[13]  Willy Zwaenepoel,et al.  Execution replay for treadmarks , 1997, PDP.

[14]  Satish Narayanasamy,et al.  BugNet: continuously recording program execution for deterministic replay debugging , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[15]  Nicholas Nethercote,et al.  Valgrind: a framework for heavyweight dynamic binary instrumentation , 2007, PLDI '07.

[16]  Srikanth Kandula,et al.  Flashback: A Lightweight Extension for Rollback and Deterministic Replay for Software Debugging , 2004, USENIX Annual Technical Conference, General Track.

[17]  Koen De Bosschere,et al.  Record/replay for nondeterministic program executions , 2003, CACM.

[18]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[19]  Koen De Bosschere,et al.  Execution replay for an MPI-based multi-threaded runtime system , 1999, PARCO.

[20]  Samuel T. King,et al.  Debugging Operating Systems with Time-Traveling Virtual Machines (Awarded General Track Best Paper Award!) , 2005, USENIX Annual Technical Conference, General Track.

[21]  Yasushi Saito,et al.  Jockey: a user-space library for record-replay debugging , 2005, AADEBUG'05.

[22]  Peter M. Chen,et al.  Exploring failure transparency and the limits of generic recovery , 2000, OSDI.

[23]  Robert O. Hastings,et al.  Fast detection of memory leaks and access errors , 1991 .

[24]  Barton P. Miller,et al.  Optimal tracing and replay for debugging message-passing parallel programs , 1992, Proceedings Supercomputing '92.

[25]  Satish Narayanasamy,et al.  Recording shared memory dependencies using strata , 2006, ASPLOS XII.

[26]  Hans A. Hansson,et al.  Using deterministic replay for debugging of distributed real-time systems , 2000, Proceedings 12th Euromicro Conference on Real-Time Systems. Euromicro RTS 2000.

[27]  Mark Christiaens,et al.  A Taxonomy of Execution Replay Systems , 2003 .