DEFINED: Deterministic Execution for Interactive Control-Plane Debugging

Large-scale networks are among the most complex software infrastructures in existence. Unfortunately, the extreme complexity of their basis, the control-plane software, leads to a rich variety of nondeterministic failure modes and anomalies. Research on debugging modern control-plane software has focused on designing comprehensive record and replay systems, but the large volumes of recordings often hinder the scalability of these designs. Here, we argue for a different approach. Namely, we take the position that deterministic network execution would vastly simplify the control-plane debugging process. This paper presents the design and implementation of DEFINED, a user-space substrate for interactive debugging that provides deterministic execution of networks in highly distributed and dynamic environments. We demonstrate our system's advantages by reproducing discovery of known ordering and timing bugs in popular software routing platforms, XORP and Quagga. Using Rocketfuel topologies and routing data from a Tier-1 backbone, we show DEFINED is practical and scalable for interactive fault diagnosis in large networks.

[1]  Marek Olszewski,et al.  Kendo: efficient deterministic multithreading in software , 2009, ASPLOS.

[2]  Amin Vahdat,et al.  Pip: Detecting the Unexpected in Distributed Systems , 2006, NSDI.

[3]  Junfeng Yang,et al.  Stable Deterministic Multithreading through Schedule Memoization , 2010, OSDI.

[4]  Samuel T. King,et al.  ReVirt: enabling intrusion analysis through virtual-machine logging and replay , 2002, OPSR.

[5]  Ion Stoica,et al.  Focus Replay Debugging Effort on the Control Plane , 2010, HotDep.

[6]  Dawson R. Engler,et al.  A few billion lines of code later , 2010, Commun. ACM.

[7]  Sen Hu,et al.  Efficient system-enforced deterministic parallelism , 2010, OSDI.

[8]  Anja Feldmann,et al.  OFRewind: Enabling Record and Replay Troubleshooting for Networks , 2011, USENIX Annual Technical Conference.

[9]  Yuanyuan Zhou,et al.  AVIO: Detecting Atomicity Violations via Access-Interleaving Invariants , 2007, IEEE Micro.

[10]  Luis Ceze,et al.  Deterministic Process Groups in dOS , 2010, OSDI.

[11]  Jason Flinn,et al.  Speculative execution in a distributed file system , 2005, SOSP '05.

[12]  Barton P. Miller,et al.  Optimal tracing and replay for debugging message-passing parallel programs , 1992, Supercomputing '92.

[13]  Josep Torrellas,et al.  CADRE: Cycle-Accurate Deterministic Replay for Hardware Debugging , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[14]  David R. Jefferson,et al.  Virtual time , 1985, ICPP.

[15]  Seif Haridi,et al.  Distributed Algorithms , 1992, Lecture Notes in Computer Science.

[16]  Srikanth Kandula,et al.  Flashback: A Lightweight Extension for Rollback and Deterministic Replay for Software Debugging , 2004, USENIX Annual Technical Conference, General Track.

[17]  Dan Grossman,et al.  CoreDet: a compiler and runtime system for deterministic multithreaded execution , 2010, ASPLOS XV.

[18]  Nick McKeown,et al.  Where is the debugger for my software-defined network? , 2012, HotSDN '12.

[19]  Satish Narayanasamy,et al.  Respec: Efficient Online Multiprocessor Replay via Speculation and External Determinism , 2010, ASPLOS 2010.

[20]  Tal Garfinkel,et al.  Understanding data lifetime via whole system simulation , 2004 .

[21]  Wei Lin,et al.  WiDS Checker: Combating Bugs in Distributed Systems , 2007, NSDI.

[22]  Luis Ceze,et al.  DDOS: taming nondeterminism in distributed systems , 2013, ASPLOS '13.

[23]  Emery D. Berger,et al.  Dthreads: efficient deterministic multithreading , 2011, SOSP.

[24]  Jeffrey Overbey,et al.  A type and effect system for deterministic parallel Java , 2009, OOPSLA '09.

[25]  Ion Stoica,et al.  Friday: Global Comprehension for Distributed Replay , 2007, NSDI.

[26]  George Varghese,et al.  Route flap damping exacerbates internet routing convergence , 2002, SIGCOMM '02.

[27]  Amihai Motro,et al.  The Time Warp mechanism for database concurrency control , 1986, 1986 IEEE Second International Conference on Data Engineering.

[28]  Kamin Whitehouse,et al.  Clairvoyant: a comprehensive source-level debugger for wireless sensor networks , 2007, SenSys '07.

[29]  Mike Hibler,et al.  An integrated experimental environment for distributed systems and networks , 2002, OPSR.

[30]  Ratul Mahajan,et al.  Measuring ISP topologies with Rocketfuel , 2004, IEEE/ACM Transactions on Networking.

[31]  Samuel T. King,et al.  Debugging Operating Systems with Time-Traveling Virtual Machines (Awarded General Track Best Paper Award!) , 2005, USENIX Annual Technical Conference, General Track.