DOVE: Diagnosis-driven SLO Violation Detection

Service-level objectives (SLOs), as network performance requirements for delay and packet loss typically, should be guaranteed for increasing high-performance applications, e.g., telesurgery and cloud gaming. However, SLO violations are common and destructive in today’s network operation. Detection and diagnosis, meaning monitoring performance to discover anomalies and analyzing causality of SLO violations respectively, are crucial for fast recovery. Unfortunately, existing diagnosis approaches require exhaustive causal information to function. Meanwhile, existing detection tools incur large overhead or are only able to provide limited information for diagnosis. This paper presents DOVE, a diagnosis-driven SLO detection system with high accuracy and low overhead. The key idea is to identify and report the information needed by diagnosis along with SLO violation alerts from the data plane selectively and efficiently. Network segmentation is introduced to balance scalability and accuracy. Novel algorithms to measure packet loss and percentile delay are implemented completely on the data plane without the involvement of the control plane for fine-grained SLO detection. We implement and deploy DOVE on Tofino and P4 software switch (BMv2) and show the effectiveness of DOVE with a use case. The reported SLO violation alerts and diagnosis-needing information are compared with ground truth and show high accuracy (>97%). Our evaluation also shows that DOVE introduces up to two orders of magnitude less traffic overhead than NetSight. In addition, memory utilization and required processing ability are low to be deployable in real network topologies.

[1]  Vladimir Braverman,et al.  QPipe: quantiles sketch fully in the data plane , 2019, CoNEXT.

[2]  Benjamin Teitelbaum,et al.  A One-way Active Measurement Protocol (OWAMP) , 2006, RFC.

[3]  Peng Liu,et al.  Elastic sketch: adaptive and fast network-wide measurements , 2018, SIGCOMM.

[4]  David Wetherall,et al.  Studying Black Holes in the Internet with Hubble , 2008, NSDI.

[5]  Pengcheng Zhang,et al.  Flow Event Telemetry on Programmable Data Plane , 2020, SIGCOMM.

[6]  Ben Y. Zhao,et al.  Packet-Level Telemetry in Large Datacenter Networks , 2015, SIGCOMM.

[7]  Sanjeev Khanna,et al.  Why and Where: A Characterization of Data Provenance , 2001, ICDT.

[8]  Nick McKeown,et al.  I Know What Your Packet Did Last Hop: Using Packet Histories to Troubleshoot Networks , 2014, NSDI.

[9]  Yang Wu,et al.  Zeno: Diagnosing Performance Problems with Temporal Provenance , 2019, NSDI.

[10]  Rodrigo Fonseca,et al.  Planck , 2014, SIGCOMM.

[11]  Ion Stoica,et al.  Declarative networking , 2009, Commun. ACM.

[12]  Jeffrey Dean,et al.  Designs, Lessons and Advice from Building Large Distributed Systems , 2009 .

[13]  Walter Willinger,et al.  Sonata: query-driven streaming network telemetry , 2018, SIGCOMM.

[14]  Jennifer Rexford,et al.  Fine-grained queue measurement in the data plane , 2019, CoNEXT.

[15]  Myungjin Lee,et al.  Distributed Network Monitoring and Debugging with SwitchPointer , 2018, NSDI.

[16]  Pierre Schaus,et al.  REPETITA: Repeatable Experiments for Performance Evaluation of Traffic-Engineering Algorithms , 2017, ArXiv.

[17]  Myungjin Lee,et al.  Not all microseconds are equal: fine-grained per-flow measurements with reference latency interpolation , 2010, SIGCOMM '10.

[18]  Andreas Haeberlen,et al.  Distributed Time-aware Provenance , 2012, Proc. VLDB Endow..

[19]  Jeffrey D. Case,et al.  Simple Network Management Protocol (SNMP) , 1989, RFC.

[20]  Hua Chen,et al.  Pingmesh: A Large-Scale System for Data Center Network Latency Measurement and Analysis , 2015, SIGCOMM.

[21]  Mun Choon Chan,et al.  BurstRadar: Practical Real-time Microburst Monitoring for Datacenter Networks , 2018, APSys.

[22]  Behnaz Arzani,et al.  Scouts: Improving the Diagnosis Process Through Domain-customized Incident Routing , 2020, SIGCOMM.

[23]  Kirill Levchenko,et al.  IntSight: diagnosing SLO violations with in-band network telemetry , 2020, CoNEXT.

[24]  Donald Beaver,et al.  Dapper, a Large-Scale Distributed Systems Tracing Infrastructure , 2010 .

[25]  George Varghese,et al.  Every microsecond counts: tracking fine-grain latencies with a lossy difference aggregator , 2009, SIGCOMM '09.

[26]  Greg Mirsky,et al.  Alternate-Marking Method for Passive and Hybrid Performance Monitoring , 2020, RFC.

[27]  George Varghese,et al.  P4: programming protocol-independent packet processors , 2013, CCRV.

[28]  Minlan Yu,et al.  LossRadar: Fast Detection of Lost Packets in Data Center Networks , 2016, CoNEXT.