Data Center Diagnostics with Network Provenance

Diagnosing problems in data centers has always been a challenging problem due to their complexity and heterogeneity. Among recent proposals for addressing this challenge, one promising approach leverages provenance, which provides the fundamental functionality that is needed for performing fault diagnosis and debugging—a way to track direct and indirect causal relationships between system states and their changes. This information is valuable, since it permits system operators to tie observed symptoms of a faults to their potential root causes. However, capturing provenance in a data center is challenging because, at high data rates, it would impose a substantial cost. In this paper, we introduce techniques that can help with this: We show how to reduce the cost of maintaining provenance by leveraging structural similarities for compression, and by offloading expensive but highly parallel operations to hardware. We also discuss our progress towards transforming provenance into compact actionable diagnostic decisions to repair problems caused by misconfigurations and program bugs.

[1]  Quoc Trung Tran,et al.  How to ConQueR why-not questions , 2010, SIGMOD Conference.

[2]  Andreas Haeberlen,et al.  Answering why-not queries in software-defined networks with negative provenance , 2013, HotNets.

[3]  Nick McKeown,et al.  Where is the debugger for my software-defined network? , 2012, HotSDN '12.

[4]  Claire Le Goues,et al.  A systematic study of automated program repair: Fixing 55 out of 105 bugs for $8 each , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[5]  Shriram Krishnamurthi,et al.  Tierless Programming and Reasoning for Software-Defined Networks , 2014, NSDI.

[6]  Anja Feldmann,et al.  OFRewind: Enabling Record and Replay Troubleshooting for Networks , 2011, USENIX Annual Technical Conference.

[7]  George Varghese,et al.  Automatic Test Packet Generation , 2012, IEEE/ACM Transactions on Networking.

[8]  Miryung Kim,et al.  Titian: Data Provenance Support in Spark , 2015, Proc. VLDB Endow..

[9]  Andreas Haeberlen,et al.  One Primitive to Diagnose Them All: Architectural Support for Internet Diagnostics , 2017, EuroSys.

[10]  Michael D. Ernst,et al.  Automated diagnosis of software configuration errors , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[11]  Andreas Haeberlen,et al.  Data Provenance at Internet Scale: Architecture, Experiences, and the Road Ahead , 2017, CIDR.

[12]  Ming Zhang,et al.  Detecting traffic differentiation in backbone ISPs with NetPolice , 2009, IMC '09.

[13]  Andreas Haeberlen,et al.  The Good, the Bad, and the Differences: Better Network Diagnostics with Differential Provenance , 2016, SIGCOMM.

[14]  Ming Zhang,et al.  Uncovering Performance Differences Among Backbone ISPs with Netdiff , 2008, NSDI.

[15]  Andreas Haeberlen,et al.  Automated Network Repair with Meta Provenance , 2015, HotNets.

[16]  Michael D. Ernst,et al.  Automatically patching errors in deployed software , 2009, SOSP '09.

[17]  Jennifer Widom,et al.  RAMP: A System for Capturing and Tracing Provenance in MapReduce Workflows , 2011, Proc. VLDB Endow..

[18]  Adriane Chapman,et al.  Efficient provenance storage , 2008, SIGMOD Conference.

[19]  Dan Suciu,et al.  Bringing Provenance to Its Full Potential Using Causal Reasoning , 2011, TaPP.

[20]  Nikolaj Bjørner,et al.  Z3: An Efficient SMT Solver , 2008, TACAS.

[21]  Xiaozhou Li,et al.  Efficient querying and maintenance of network provenance at internet-scale , 2010, SIGMOD Conference.

[22]  Joseph Y. Halpern,et al.  Causes and Explanations: A Structural-Model Approach. Part I: Causes , 2000, The British Journal for the Philosophy of Science.

[23]  Brighten Godfrey,et al.  VeriFlow: verifying network-wide invariants in real time , 2012, HotSDN '12.

[24]  Daniel Deutch,et al.  Provenance for Natural Language Queries , 2017, Proc. VLDB Endow..

[25]  Ion Stoica,et al.  Declarative networking , 2009, Commun. ACM.

[26]  Andreas Haeberlen,et al.  Secure network provenance , 2011, SOSP.

[27]  Andreas Haeberlen,et al.  Fighting Cybercrime with Packet Attestation , 2011 .

[28]  Nick McKeown,et al.  I Know What Your Packet Did Last Hop: Using Packet Histories to Troubleshoot Networks , 2014, NSDI.

[29]  Junda Liu,et al.  Libra: Divide and Conquer to Verify Forwarding Tables in Huge Networks , 2014, NSDI.

[30]  Suman Nath,et al.  Tracing data errors with view-conditioned causality , 2011, SIGMOD '11.

[31]  Joseph Y. Halpern,et al.  Causes and explanations: A structural-model approach , 2000 .

[32]  Angelos D. Keromytis,et al.  Countering network worms through automatic patch generation , 2005, IEEE Security & Privacy Magazine.

[33]  Jeffrey F. Naughton,et al.  On the provenance of non-answers to queries over extracted data , 2008, Proc. VLDB Endow..

[34]  Chen Chen,et al.  Distributed Provenance Compression , 2017, SIGMOD Conference.

[35]  Adriane Chapman,et al.  Why Not? , 1965, SIGMOD Conference.

[36]  Nate Foster,et al.  NetKAT: semantic foundations for networks , 2014, POPL.

[37]  A. Alexandrova The British Journal for the Philosophy of Science , 1965, Nature.

[38]  Sanjeev Khanna,et al.  Why and Where: A Characterization of Data Provenance , 2001, ICDT.

[39]  Val Tannen,et al.  Querying data provenance , 2010, SIGMOD Conference.

[40]  Andreas Haeberlen,et al.  Distributed Time-aware Provenance , 2012, Proc. VLDB Endow..

[41]  George Varghese,et al.  Real Time Network Policy Checking Using Header Space Analysis , 2013, NSDI.

[42]  Brice Augustin,et al.  Avoiding traceroute anomalies with Paris traceroute , 2006, IMC '06.

[43]  Ramesh Govindan,et al.  A General Approach to Network Configuration Analysis , 2015, NSDI.

[44]  George Varghese,et al.  Header Space Analysis: Static Checking for Networks , 2012, NSDI.

[45]  Gustavo Alonso,et al.  SharedDB: Killing One Thousand Queries With One Stone , 2012, Proc. VLDB Endow..

[46]  Craig Partridge,et al.  Single-packet IP traceback , 2002, TNET.

[47]  Andreas Haeberlen,et al.  Automated Bug Removal for Software-Defined Networks , 2017, NSDI.

[48]  Ratul Mahajan,et al.  User-level internet path diagnosis , 2003, SOSP '03.

[49]  George C. Necula,et al.  Minimizing Faulty Executions of Distributed Systems , 2016, NSDI.

[50]  Samuel Madden,et al.  Continuously adaptive continuous queries over streams , 2002, SIGMOD '02.

[51]  Russell J. Clark,et al.  Kinetic: Verifiable Dynamic Network Control , 2015, NSDI.

[52]  Dawei Qi,et al.  SemFix: Program repair via semantic analysis , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[53]  Donald Beaver,et al.  Dapper, a Large-Scale Distributed Systems Tracing Infrastructure , 2010 .

[54]  Andreas Haeberlen,et al.  Diagnosing missing events in distributed systems with negative provenance , 2014, SIGCOMM.

[55]  Katerina J. Argyraki,et al.  Loss and Delay Accountability for the Internet , 2007, 2007 IEEE International Conference on Network Protocols.

[56]  Nick Feamster,et al.  Detecting BGP configuration faults with static analysis , 2005 .

[57]  Margo I. Seltzer,et al.  Provenance-Aware Storage Systems , 2006, USENIX Annual Technical Conference, General Track.

[58]  Zhi Liu,et al.  Troubleshooting blackbox SDN control software with minimal causal sequences , 2014 .

[59]  David Walker,et al.  Composing Software Defined Networks , 2013, NSDI.

[60]  Christopher Ré,et al.  Probabilistic databases: diamonds in the dirt , 2009, CACM.