IntSight: diagnosing SLO violations with in-band network telemetry

Performance requirements for many of today's high-perfor-mance networks are expressed as service-level objectives (SLOs), i.e., precise guarantees, typically on latency and bandwidth, that a user can expect from the network. For network operators, monitoring their own SLO compliance, and quickly diagnosing any violations, is a critical element for effective operations. Unfortunately, existing network architectures are not engineered for this purpose; there is no mechanism, for example, for the operator to monitor the 95th per-centile latency experienced by a customer. Data plane programmability has made per-packet measurements possible but brings the challenge of keeping the monitoring overhead low and practical. In this paper, we present IntSight, a system for highly accurate and fine-grained detection and diagnosis of SLO violations. The main contribution of IntSight is, building upon in-band telemetry, introducing path-wise computation of network metrics and selective generation of reports. We show the effectiveness of IntSight by way of two use cases. Our evaluation using real networks also shows that IntSight generates up to two orders of magnitude less monitoring traffic than state-of-the-art approaches. Furthermore, its processing and memory requirements are low and therefore compatible with currently existing programmable platforms.

[1]  Andrew W. Moore,et al.  NetFPGA SUME: Toward 100 Gbps as Research Commodity , 2014, IEEE Micro.

[2]  Pierre Schaus,et al.  REPETITA: Repeatable Experiments for Performance Evaluation of Traffic-Engineering Algorithms , 2017, ArXiv.

[3]  Ben Y. Zhao,et al.  Packet-Level Telemetry in Large Datacenter Networks , 2015, SIGCOMM.

[4]  Myungjin Lee,et al.  Distributed Network Monitoring and Debugging with SwitchPointer , 2018, NSDI.

[5]  Rodrigo Fonseca,et al.  Planck , 2014, SIGCOMM.

[6]  Min Zhu,et al.  B4: experience with a globally-deployed software defined wan , 2013, SIGCOMM.

[7]  ZhuYibo,et al.  Packet-Level Telemetry in Large Datacenter Networks , 2015 .

[8]  Anirudh Sivaraman,et al.  In-band Network Telemetry via Programmable Dataplanes , 2015 .

[9]  Benjamin Teitelbaum,et al.  A One-way Active Measurement Protocol (OWAMP) , 2006, RFC.

[10]  George Varghese,et al.  Forwarding metamorphosis: fast programmable match-action processing in hardware for SDN , 2013, SIGCOMM.

[11]  Jeffrey D. Case,et al.  Simple Network Management Protocol (SNMP) , 1989, RFC.

[12]  James R. Larus,et al.  Efficient path profiling , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[13]  Al Morton,et al.  A Two-Way Active Measurement Protocol (TWAMP) , 2008, RFC.

[14]  Srikanth Kandula,et al.  Achieving high utilization with software-driven WAN , 2013, SIGCOMM.

[15]  Nick McKeown,et al.  I Know What Your Packet Did Last Hop: Using Packet Histories to Troubleshoot Networks , 2014, NSDI.

[16]  Benoit Claise,et al.  Internet Engineering Task Force (ietf) Flow Aggregation for the Ip Flow Information Export (ipfix) Protocol , 2022 .

[17]  George Varghese,et al.  CONGA: distributed congestion-aware load balancing for datacenters , 2015, SIGCOMM.

[18]  Laurent Vanbever,et al.  Stroboscope: Declarative Network Monitoring on a Budget , 2018, NSDI.

[19]  Keqiang He,et al.  Presto: Edge-based Load Balancing for Fast Datacenter Networks , 2015, SIGCOMM.

[20]  Walter Willinger,et al.  Sonata: query-driven streaming network telemetry , 2018, SIGCOMM.

[21]  Ramesh Govindan,et al.  Evolve or Die: High-Availability Design Principles Drawn from Googles Network Infrastructure , 2016, SIGCOMM.

[22]  R. Krzanowski,et al.  A Two-Way Active Measurement Protocol (TWAMP)", RFC 5357 , 2008 .

[23]  Stefan Schmid,et al.  PURR: a primitive for reconfigurable fast reroute: hope for the best and program for the worst , 2019, CoNEXT.

[24]  Benoit Claise,et al.  Specification of the IP Flow Information Export (IPFIX) Protocol for the Exchange of Flow Information , 2013, RFC.

[25]  Guido Appenzeller,et al.  Sizing router buffers (redux) , 2019, CCRV.

[26]  George Varghese,et al.  P4: programming protocol-independent packet processors , 2013, CCRV.

[27]  Amin Vahdat,et al.  BwE: Flexible, Hierarchical Bandwidth Allocation for WAN Distributed Computing , 2015, Comput. Commun. Rev..

[28]  Arvind Krishnamurthy,et al.  High-resolution measurement of data center microbursts , 2017, Internet Measurement Conference.

[29]  Jennifer Rexford,et al.  Fine-grained queue measurement in the data plane , 2019, CoNEXT.