Grasp the Root Causes in the Data Plane: Diagnosing Latency Problems with SpiderMon

Unexplained performance degradation is one of the most severe problems in data center networks. The increasing scale of the network makes it even harder to maintain good performance for all users with a low-cost solution. Our system SpiderMon monitors network performance and debugs performance failures inside the network with little overhead. SpiderMon provides a two-phase solution that runs in the data plane. In the monitoring phase, it keeps track of the performance of every flow in the network; upon detecting performance problems, it triggers a debugging phase using a causality analyzer to find out the root cause of performance degradation. To implement these two phases, SpiderMon exploits the capabilities of high-speed programmable switches (e.g., per-packet monitoring, stateful memory). We prototype SpiderMon on using the BMv2 model of P4, and our preliminary evaluation shows that SpiderMon is able to quickly find the root cause of performance degradation problems with minimal overhead. SpiderMon achieves nearly-zero overhead during the monitoring phase and efficiently collects relevant data from switches during the debugging phase.

[1]  Myungjin Lee,et al.  Simplifying Datacenter Network Debugging with PathDump , 2016, OSDI.

[2]  David Walker,et al.  SNAP: Stateful Network-Wide Abstractions for Packet Processing , 2015, SIGCOMM.

[3]  Nick McKeown,et al.  The P4->NetFPGA Workflow for Line-Rate Packet Processing , 2019, FPGA.

[4]  Minlan Yu,et al.  HONE: Joint Host-Network Traffic Management in Software-Defined Networks , 2014, Journal of Network and Systems Management.

[5]  Mun Choon Chan,et al.  BurstRadar: Practical Real-time Microburst Monitoring for Datacenter Networks , 2018, APSys.

[6]  Vladimir Braverman,et al.  One Sketch to Rule Them All: Rethinking Network Flow Monitoring with UnivMon , 2016, SIGCOMM.

[7]  Yan Luo,et al.  EQuery: Enable event-driven declarative queries in programmable network measurement , 2018, NOMS 2018 - 2018 IEEE/IFIP Network Operations and Management Symposium.

[8]  Nan Jiang,et al.  Network congestion avoidance through Speculative Reservation , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[9]  Nick McKeown,et al.  I Know What Your Packet Did Last Hop: Using Packet Histories to Troubleshoot Networks , 2014, NSDI.

[10]  Minlan Yu,et al.  FlowRadar: A Better NetFlow for Data Centers , 2016, NSDI.

[11]  Myungjin Lee,et al.  Distributed Network Monitoring and Debugging with SwitchPointer , 2018, NSDI.

[12]  C SnoerenAlex,et al.  Inside the Social Network's (Datacenter) Network , 2015 .

[13]  Ion Stoica,et al.  Confluo: Distributed Monitoring and Diagnosis Stack for High-speed Networks , 2019, NSDI.

[14]  Minlan Yu,et al.  Software Defined Traffic Measurement with OpenSketch , 2013, NSDI.

[15]  Ramesh Govindan,et al.  Trumpet: Timely and Precise Triggers in Data Centers , 2016, SIGCOMM.

[16]  Jennifer Rexford,et al.  Catching the Microburst Culprits with Snappy , 2018, SelfDN@SIGCOMM.

[17]  Jennifer Rexford,et al.  Dapper: Data Plane Performance Diagnosis of TCP , 2016, SOSR.

[18]  Hua Chen,et al.  Pingmesh: A Large-Scale System for Data Center Network Latency Measurement and Analysis , 2015, SIGCOMM.

[19]  Huynh Tu Dang,et al.  P4FPGA: A Rapid Prototyping Framework for P4 , 2017, SOSR.

[20]  Ben Y. Zhao,et al.  Packet-Level Telemetry in Large Datacenter Networks , 2015, SIGCOMM.

[21]  Alex C. Snoeren,et al.  Inside the Social Network's (Datacenter) Network , 2015, Comput. Commun. Rev..

[22]  Amin Vahdat,et al.  Hedera: Dynamic Flow Scheduling for Data Center Networks , 2010, NSDI.

[23]  Andreas Haeberlen,et al.  One Primitive to Diagnose Them All: Architectural Support for Internet Diagnostics , 2017, EuroSys.

[24]  Adam J. Aviv,et al.  Scaling Hardware Accelerated Network Monitoring to Concurrent and Dynamic Queries With *Flow , 2018, USENIX Annual Technical Conference.

[25]  T. V. Lakshman,et al.  Typhoon: An SDN Enhanced Real-Time Big Data Streaming Framework , 2017, CoNEXT.

[26]  Anirudh Sivaraman,et al.  Language-Directed Hardware Design for Network Performance Monitoring , 2017, SIGCOMM.

[27]  Minlan Yu,et al.  SilkRoad: Making Stateful Layer-4 Load Balancing Fast and Cheap Using Switching ASICs , 2017, SIGCOMM.

[28]  Walter Willinger,et al.  Network Monitoring as a Streaming Analytics Problem , 2016, HotNets.

[29]  Amin Vahdat,et al.  TIMELY: RTT-based Congestion Control for the Datacenter , 2015, Comput. Commun. Rev..

[30]  Guang Cheng,et al.  Intelligence Enabled SDN Fault Localization via Programmable In-band Network Telemetry , 2019, 2019 IEEE 20th International Conference on High Performance Switching and Routing (HPSR).

[31]  Behnaz Arzani,et al.  007: Democratically Finding The Cause of Packet Drops , 2018, NSDI.

[32]  Rachit Agarwal,et al.  LoCo: Localizing Congestion , 2019 .