dShark: A General, Easy to Program and Scalable Framework for Analyzing In-network Packet Traces

Distributed, in-network packet capture is still the last resort for diagnosing network problems. Despite recent advances in collecting packet traces scalably, effectively utilizing pervasive packet captures still poses important challenges. Arbitrary combinations of middleboxes which transform packet headers make it challenging to even identify the same packet across multiple hops; packet drops in the collection system create ambiguities that must be handled; the large volume of captures, and their distributed nature, make it hard to do even simple processing; and the one-off and urgent nature of problems tends to generate ad-hoc solutions that are not reusable and do not scale. In this paper we propose dShark to address these challenges. dShark allows intuitive groupings of packets across multiple traces that are robust to header transformations and capture noise, offering simple streaming data abstractions for network operators. Using dShark on real-time packet captures from a major cloud provider, we show that dShark makes it easy to write concise and reusable queries against distributed packet traces that solve many common problems in diagnosing complex networks. Our evaluation shows that dShark can analyze production packet traces with more than 10 Mpps throughput on a commodity server, and has near-linear speedup when scaling out on multiple servers.

[1]  Susan I. Hruska,et al.  Expert network development environment for automating machine fault diagnosis , 1996, Defense + Commercial Sensing.

[2]  Yin Zhang,et al.  On the characteristics and origins of internet flow rates , 2002, SIGCOMM '02.

[3]  Ratul Mahajan,et al.  User-level internet path diagnosis , 2003, SOSP '03.

[4]  Randy H. Katz,et al.  An algebraic approach to practical and scalable overlay network monitoring , 2004, SIGCOMM '04.

[5]  Benoit Claise,et al.  Cisco Systems NetFlow Services Export Version 9 , 2004, RFC.

[6]  Zongpeng Li,et al.  sFlow: towards resource-efficient and agile service federation in service overlay networks , 2004, 24th International Conference on Distributed Computing Systems, 2004. Proceedings..

[7]  Donald F. Towsley,et al.  Network tomography from aggregate loss reports , 2005, Perform. Evaluation.

[8]  Albert G. Greenberg,et al.  IP fault localization via risk modeling , 2005, NSDI.

[9]  Srikanth Kandula,et al.  Shrink: a tool for failure diagnosis in IP networks , 2005, MineNet '05.

[10]  Nick G. Duffield,et al.  Network Tomography of Binary Network Performance Characteristics , 2006, IEEE Transactions on Information Theory.

[11]  A. Greenberg,et al.  Towards highly reliable enterprise network services via inference of multi-level dependencies , 2007, SIGCOMM '07.

[12]  Sanjay Ghemawat,et al.  MapReduce: simplified data processing on large clusters , 2008, CACM.

[13]  Nick Feamster,et al.  Practical issues with using network tomography for fault diagnosis , 2008, CCRV.

[14]  Matthew Mathis,et al.  Pathdiag: Automated TCP Diagnosis , 2008, PAM.

[15]  WillingerWalter,et al.  Spatio-temporal compressive sensing and internet traffic matrices , 2009 .

[16]  Yao Zhao,et al.  Towards Unbiased End-to-End Network Diagnosis , 2006, IEEE/ACM Transactions on Networking.

[17]  Anja Feldmann,et al.  Network troubleshooting with Mirror VNets , 2010, 2010 IEEE Globecom Workshops.

[18]  Y. Ahmet Sekercioglu,et al.  Intelligent Automated Diagnosis of Client Device Bottlenecks in Private Clouds , 2011, 2011 Fourth IEEE International Conference on Utility and Cloud Computing.

[19]  Minlan Yu,et al.  Profiling Network Performance for Multi-tier Data Center Applications , 2011, NSDI.

[20]  George Varghese,et al.  Header Space Analysis: Static Checking for Networks , 2012, NSDI.

[21]  Luigi Rizzo,et al.  netmap: A Novel Framework for Fast Packet I/O , 2012, USENIX ATC.

[22]  Mark Handley,et al.  How Hard Can It Be? Designing and Implementing a Deployable Multipath TCP , 2012, NSDI.

[23]  David A. Maltz,et al.  Surviving failures in bandwidth-constrained datacenters , 2012, CCRV.

[24]  Kin K. Leung,et al.  Measurement Design Framework for Network Tomography Using Fisher Information , 2013 .

[25]  Marco Canini,et al.  FatTire: declarative fault tolerance for software-defined networks , 2013, HotSDN '13.

[26]  Nick McKeown,et al.  Leveraging SDN layering to systematically troubleshoot networks , 2013, HotSDN '13.

[27]  Junda Liu,et al.  Ensuring connectivity via data plane mechanisms , 2013, NSDI 2013.

[28]  Benoit Claise,et al.  Specification of the IP Flow Information Export (IPFIX) Protocol for the Exchange of Flow Information , 2013, RFC.

[29]  Marco Canini,et al.  Automatic failure recovery for software-defined networks , 2013, HotSDN '13.

[30]  Joseph D. Touch,et al.  Updated Specification of the IPv4 ID Field , 2013, RFC.

[31]  George Varghese,et al.  Real Time Network Policy Checking Using Header Space Analysis , 2013, NSDI.

[32]  M. Alizadeh,et al.  CONGA: distributed congestion-aware load balancing for datacenters , 2015, SIGCOMM.

[33]  Olivier Bonaventure,et al.  Multipath TCP , 2014 .

[34]  Wenji Wu,et al.  WireCAP: a novel packet capture engine for commodity NICs in high-speed networks , 2014, Internet Measurement Conference.

[35]  Nick McKeown,et al.  I Know What Your Packet Did Last Hop: Using Packet Histories to Troubleshoot Networks , 2014, NSDI.

[36]  Kin K. Leung,et al.  Node Failure Localization via Network Tomography , 2014, Internet Measurement Conference.

[37]  Herodotos Herodotou,et al.  Scalable near real-time failure localization of data center networks , 2014, KDD.

[38]  Rodrigo Fonseca,et al.  Planck , 2014, SIGCOMM.

[39]  Theophilus A. Benson,et al.  RINC: Real-Time Inference-based Network Diagnosis in the Cloud , 2014 .

[40]  George Varghese,et al.  Gestalt: Fast, Unified Fault Localization for Networked Systems , 2014, USENIX Annual Technical Conference.

[41]  Ramesh Govindan,et al.  A General Approach to Network Configuration Analysis , 2015, NSDI.

[42]  Ben Y. Zhao,et al.  Packet-Level Telemetry in Large Datacenter Networks , 2015, SIGCOMM.

[43]  Jialin Li,et al.  Designing Distributed Systems Using Approximate Synchrony in Data Center Networks , 2015, NSDI.

[44]  Hua Chen,et al.  Pingmesh: A Large-Scale System for Data Center Network Latency Measurement and Analysis , 2015, SIGCOMM.

[45]  Da Yu,et al.  Simon: scriptable interactive monitoring for SDNs , 2015, SOSR.

[46]  Ramesh Govindan,et al.  Trumpet: Timely and Precise Triggers in Data Centers , 2016, SIGCOMM.

[47]  Dan Pei,et al.  Fast and Cautious: Leveraging Multi-path Diversity for Transport Loss Recovery in Data Centers , 2016, USENIX Annual Technical Conference.

[48]  Myungjin Lee,et al.  Simplifying Datacenter Network Debugging with PathDump , 2016, OSDI.

[49]  Reynold Xin,et al.  Apache Spark , 2016 .

[50]  David Walker,et al.  Compiling Path Queries , 2016, NSDI.

[51]  Behnaz Arzani,et al.  Taking the Blame Game out of Data Centers Operations with NetPoirot , 2016, SIGCOMM.

[52]  Marco Canini,et al.  Ground Control to Major Faults: Towards a Fault Tolerant and Adaptive SDN Control Network , 2016, 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshop (DSN-W).

[53]  Minlan Yu,et al.  LossRadar: Fast Detection of Lost Packets in Data Center Networks , 2016, CoNEXT.

[54]  Masayuki Murata,et al.  Decentralized boolean network tomography based on network partitioning , 2016, NOMS 2016 - 2016 IEEE/IFIP Network Operations and Management Symposium.

[55]  Minlan Yu,et al.  FlowRadar: A Better NetFlow for Data Centers , 2016, NSDI.

[56]  Behnaz Arzani,et al.  Closing the Network Diagnostics Gap with Vigil , 2017, SIGCOMM Posters and Demos.

[57]  Fan Yang,et al.  The QUIC Transport Protocol: Design and Internet-Scale Deployment , 2017, SIGCOMM.

[58]  Anirudh Sivaraman,et al.  Language-Directed Hardware Design for Network Performance Monitoring , 2017, SIGCOMM.

[59]  John Liagouris,et al.  Online Reconstruction of Structural Information from Datacenter Logs , 2017, EuroSys.

[60]  Jitendra Padhye,et al.  CrystalNet: Faithfully Emulating Large Production Networks , 2017, SOSP.

[61]  Behnaz Arzani,et al.  007: Democratically Finding The Cause of Packet Drops , 2018, NSDI.

[62]  Myungjin Lee,et al.  Distributed Network Monitoring and Debugging with SwitchPointer , 2018, NSDI.