Packet-Level Telemetry in Large Datacenter Networks

Debugging faults in complex networks often requires capturing and analyzing traffic at the packet level. In this task, datacenter networks (DCNs) present unique challenges with their scale, traffic volume, and diversity of faults. To troubleshoot faults in a timely manner, DCN administrators must a) identify affected packets inside large volume of traffic; b) track them across multiple network components; c) analyze traffic traces for fault patterns; and d) test or confirm potential causes. To our knowledge, no tool today can achieve both the specificity and scale required for this task. We present Everflow, a packet-level network telemetry system for large DCNs. Everflow traces specific packets by implementing a powerful packet filter on top of "match and mirror" functionality of commodity switches. It shuffles captured packets to multiple analysis servers using load balancers built on switch ASICs, and it sends "guided probes" to test or confirm potential faults. We present experiments that demonstrate Everflow's scalability, and share experiences of troubleshooting network faults gathered from running it for over 6 months in Microsoft's DCNs.

[1]  Ramesh Govindan,et al.  A General Approach to Network Configuration Analysis , 2015, NSDI.

[2]  Rodrigo Fonseca,et al.  Planck , 2014, SIGCOMM.

[3]  Ming Zhang,et al.  Duet: cloud scale load balancing with hardware and software , 2015, SIGCOMM.

[4]  Wenji Wu,et al.  WireCAP: a novel packet capture engine for commodity NICs in high-speed networks , 2014, Internet Measurement Conference.

[5]  Vyas Sekar,et al.  Testing stateful and dynamic data planes with FlowTest , 2014, HotSDN.

[6]  Ted Taekyoung Kwon,et al.  OpenSample: A Low-Latency, Sampling-Based Measurement Platform for Commodity SDN , 2014, 2014 IEEE 34th International Conference on Distributed Computing Systems.

[7]  Millions of Little Minions: Using Packets for Low Latency Network Programming and Visibility , 2014, ArXiv.

[8]  Nick McKeown,et al.  I Know What Your Packet Did Last Hop: Using Packet Histories to Troubleshoot Networks , 2014, NSDI.

[9]  Martín Casado,et al.  Network Virtualization in Multi-tenant Datacenters , 2014, NSDI.

[10]  Antonio Pescapè,et al.  Dissecting Round Trip Time on the Slow Path with a Single Packet , 2014, PAM.

[11]  George Varghese,et al.  Forwarding metamorphosis: fast programmable match-action processing in hardware for SDN , 2013, SIGCOMM.

[12]  Albert G. Greenberg,et al.  Ananta: cloud scale load balancing , 2013, SIGCOMM.

[13]  Min Zhu,et al.  B4: experience with a globally-deployed software defined wan , 2013, SIGCOMM.

[14]  Srikanth Kandula,et al.  Achieving high utilization with software-driven WAN , 2013, SIGCOMM.

[15]  Minlan Yu,et al.  Software Defined Traffic Measurement with OpenSketch , 2013, NSDI.

[16]  George Varghese,et al.  Usenix Association 10th Usenix Symposium on Networked Systems Design and Implementation (nsdi '13) 99 Real Time Network Policy Checking Using Header Space Analysis , 2022 .

[17]  Vijay Mann,et al.  Living on the edge: Monitoring network flows at the edge in cloud data centers , 2013, 2013 Fifth International Conference on Communication Systems and Networks (COMSNETS).

[18]  J. Carter,et al.  OpenSample : A Low-Latency , Sampling-Based Measurement Platform for SDN , 2013 .

[19]  Mark Handley,et al.  LOUP: who's afraid of the big bad loop? , 2012, HotNets-XI.

[20]  Brighten Godfrey,et al.  VeriFlow: verifying network-wide invariants in real time , 2012, HotSDN '12.

[21]  Luigi Rizzo,et al.  netmap: A Novel Framework for Fast Packet I/O , 2012, USENIX ATC.

[22]  Jia Wang,et al.  Tiresias: Online Anomaly Detection for Hierarchical Operational Network Data , 2012, 2012 IEEE 32nd International Conference on Distributed Computing Systems.

[23]  Klara Nahrstedt,et al.  Scaling data-plane logging in large scale networks , 2011, 2011 - MILCOM 2011 Military Communications Conference.

[24]  Anja Feldmann,et al.  OFRewind: Enabling Record and Replay Troubleshooting for Networks , 2011, USENIX Annual Technical Conference.

[25]  Dan Pei,et al.  What happened in my network: mining network events from router syslogs , 2010, IMC '10.

[26]  Albert G. Greenberg,et al.  The nature of data center traffic: measurements & analysis , 2009, IMC '09.

[27]  Jingren Zhou,et al.  SCOPE: easy and efficient parallel processing of massive data sets , 2008, Proc. VLDB Endow..

[28]  Randy H. Katz,et al.  X-Trace: A Pervasive Network Tracing Framework , 2007, NSDI.

[29]  Benoit Claise,et al.  Cisco Systems NetFlow Services Export Version 9 , 2004, RFC.

[30]  Ratul Mahajan,et al.  User-level internet path diagnosis , 2003, SOSP '03.

[31]  Peter Phaal,et al.  InMon Corporation's sFlow: A Method for Monitoring Traffic in Switched and Routed Networks , 2001, RFC.

[32]  Nick G. Duffield,et al.  Trajectory sampling for direct traffic observation , 2001, TNET.

[33]  Jeffrey D. Case,et al.  Simple network management protocol , 1995 .