Fixed It For You: Protocol Repair Using Lineage Graphs

Distributed systems are difficult to program and near impossible to debug. Existing tools that focus on single-node computation are poorly-suited to diagnose errors that involve the interaction of many machines over time. The database notion of provenance would appear to be a better fit for answering the sort of cause-and-effect questions that arise during debugging, but existing provenance-based approaches target only a narrow set of debugging scenarios. In this paper, we explore the limits of provenance-based debugging. We propose a simple query language to express common debugging questions as expressions over provenance graphs capturing traces of distributed executions. When programs and their correctness properties are written in the same highlevel declarative language, we can go a step further than highlighting errors by often generating repairs for distributed programs. We validate our prototype debugger, Nemo, on six protocols from our taxonomy of 52 real-world distributed bugs, either generating repair rules or pointing the programmer to root causes.

[1]  Margo I. Seltzer,et al.  Provenance-Aware Storage Systems , 2006, USENIX ATC, General Track.

[2]  Shan Lu,et al.  TaxDC: A Taxonomy of Non-Deterministic Concurrency Bugs in Datacenter Distributed Systems , 2016, ASPLOS.

[3]  Amin Vahdat,et al.  Life, death, and the critical transition: finding liveness bugs in systems code , 2007 .

[4]  Venkatesh Radhakrishnan,et al.  A Generic Provenance Middleware for Queries, Updates, and Transactions , 2014, TAPP.

[5]  Sébastien Bigo A Post-5G Network to Break the Eight Fallacies of Distributed Computing , 2019, 2019 Asia Communications and Photonics Conference (ACP).

[6]  George C. Necula,et al.  Minimizing Faulty Executions of Distributed Systems , 2016, NSDI.

[7]  Andreas Haeberlen,et al.  Diagnosing missing events in distributed systems with negative provenance , 2014, SIGCOMM.

[8]  James Cheney,et al.  Provenance in Databases: Why, How, and Where , 2009, Found. Trends Databases.

[9]  Peter Alvaro,et al.  Automating Failure Testing Research at Internet Scale , 2016, SoCC.

[10]  Viktor Kuncak,et al.  CrystalBall: Predicting and Preventing Inconsistencies in Deployed Distributed Systems , 2009, NSDI.

[11]  David Maier,et al.  Dedalus: Datalog in Time and Space , 2010, Datalog.

[12]  Gerard J. Holzmann,et al.  The Model Checker SPIN , 1997, IEEE Trans. Software Eng..

[13]  Thomas Moyer,et al.  Trustworthy Whole-System Provenance for the Linux Kernel , 2015, USENIX Security Symposium.

[14]  Sanjeev Khanna,et al.  Why and Where: A Characterization of Data Provenance , 2001, ICDT.

[15]  Zhendong Su,et al.  HDD: hierarchical delta debugging , 2006, ICSE.

[16]  Jennifer Widom,et al.  Provenance for Generalized Map and Reduce Workflows , 2011, CIDR.

[17]  J. D. Day,et al.  A principle for resilient sharing of distributed resources , 1976, ICSE '76.

[18]  Ion Stoica,et al.  Declarative networking: language, execution and optimization , 2006, SIGMOD Conference.

[19]  Joseph M. Hellerstein,et al.  Debugging Distributed Systems with Why-Across-Time Provenance , 2018, SoCC.

[20]  Andreas Haeberlen,et al.  Data Provenance at Internet Scale: Architecture, Experiences, and the Road Ahead , 2017, CIDR.

[21]  Andreas Haeberlen,et al.  The Good, the Bad, and the Differences: Better Network Diagnostics with Differential Provenance , 2016, SIGCOMM.

[22]  Andreas Haeberlen,et al.  Automated Bug Removal for Software-Defined Networks , 2017, NSDI.

[23]  Ken Yocum,et al.  Scalable lineage capture for debugging DISC analytics , 2013, SoCC.

[24]  Peter Bailis,et al.  The network is reliable , 2014, Commun. ACM.

[25]  Shan Lu,et al.  FCatch: Automatically Detecting Time-of-fault Bugs in Cloud Systems , 2018, ASPLOS.