PINT: Probabilistic In-band Network Telemetry

Commodity network devices support adding in-band telemetry measurements into data packets, enabling a wide range of applications, including network troubleshooting, congestion control, and path tracing. However, including such information on packets adds significant overhead that impacts both flow completion times and application-level performance. We introduce PINT, an in-band network telemetry framework that bounds the amount of information added to each packet. PINT encodes the requested data on multiple packets, allowing per-packet overhead limits that can be as low as one bit. We analyze PINT and prove performance bounds, including cases when multiple queries are running simultaneously. PINT is implemented in P4 and can be deployed on network devices.Using real topologies and traffic characteristics, we show that PINT concurrently enables applications such as congestion control, path tracing, and computing tail latencies, using only sixteen bits per packet, with performance comparable to the state of the art.

[1]  Philippe Flajolet,et al.  Birthday Paradox, Coupon Collectors, Caching Algorithms and Self-Organizing Search , 1992, Discret. Appl. Math..

[2]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[3]  Ion Stoica,et al.  Confluo: Distributed Monitoring and Diagnosis Stack for High-speed Networks , 2019, NSDI.

[4]  Anna R. Karlin,et al.  Practical network support for IP traceback , 2000, SIGCOMM.

[5]  Minlan Yu,et al.  Software Defined Traffic Measurement with OpenSketch , 2013, NSDI.

[6]  Nate Foster,et al.  NetCache: Balancing Key-Value Stores with Fast In-Network Caching , 2017, SOSP.

[7]  Alex C. Snoeren,et al.  Hash-based IP traceback , 2001, SIGCOMM '01.

[8]  Minlan Yu,et al.  HPCC: high precision congestion control , 2019, SIGCOMM.

[9]  Jennifer Rexford,et al.  HULA: Scalable Load Balancing Using Programmable Data Planes , 2016, SOSR.

[10]  Jennifer Rexford,et al.  Clove: Congestion-Aware Load Balancing at the Virtual Edge , 2017, CoNEXT.

[11]  Hua Chen,et al.  Pingmesh: A Large-Scale System for Data Center Network Latency Measurement and Analysis , 2015, SIGCOMM.

[12]  David P. Woodruff,et al.  Space-Efficient Estimation of Statistics Over Sub-Sampled Streams , 2012, PODS '12.

[13]  Yan Chen,et al.  Reversible sketches for efficient and accurate change detection over network data streams , 2004, IMC '04.

[14]  Ori Rottenstreich,et al.  Efficient Measurement on Programmable Switches Using Probabilistic Recirculation , 2018, 2018 IEEE 26th International Conference on Network Protocols (ICNP).

[15]  Myungjin Lee,et al.  Simplifying Datacenter Network Debugging with PathDump , 2016, OSDI.

[16]  Amin Vahdat,et al.  Less Is More: Trading a Little Bandwidth for Ultra-Low Latency in the Data Center , 2012, NSDI.

[17]  Roy Friedman,et al.  Fast Flow Volume Estimation , 2017, ICDCN.

[18]  Ben Y. Zhao,et al.  Packet-Level Telemetry in Large Datacenter Networks , 2015, SIGCOMM.

[19]  Behnaz Arzani,et al.  007: Democratically Finding The Cause of Packet Drops , 2018, NSDI.

[20]  Robert H. Morris,et al.  Counting large numbers of events in small registers , 1978, CACM.

[21]  Minlan Yu,et al.  Routing Oblivious Measurement Analytics , 2020, 2020 IFIP Networking Conference (Networking).

[22]  Ariel Orda,et al.  Memento: Making Sliding Windows Efficient for Heavy Hitters , 2018, IEEE/ACM Transactions on Networking.

[23]  Nick McKeown,et al.  I Know What Your Packet Did Last Hop: Using Packet Histories to Troubleshoot Networks , 2014, NSDI.

[24]  Mark Handley,et al.  TCP Extensions for Multipath Operation with Multiple Addresses , 2020, RFC.

[25]  S. Janson Tail bounds for sums of geometric and exponential variables , 2017, 1709.08157.

[26]  Haitao Wu,et al.  NetBouncer: Active Device and Link Failure Localization in Data Center Networks , 2019, NSDI.

[27]  Damu Ding,et al.  Estimating Logarithmic and Exponential Functions to Track Network Traffic Entropy in P4 , 2020, NOMS 2020 - 2020 IEEE/IFIP Network Operations and Management Symposium.

[28]  Diana Andreea Popescu,et al.  Characterizing the impact of network latency on cloud-based applications’ performance , 2017 .

[29]  Nick McKeown,et al.  Why flow-completion time is the right metric for congestion control , 2006, CCRV.

[30]  Mun Choon Chan,et al.  BurstRadar: Practical Real-time Microburst Monitoring for Datacenter Networks , 2018, APSys.

[31]  Gerard Hooghiemstra,et al.  A scaling law for the hopcount in internet , 2000 .

[32]  Gurmeet Singh Manku,et al.  Approximate counts and quantiles over sliding windows , 2004, PODS.

[33]  Zhi-Li Zhang,et al.  Quantile sampling for practical delay monitoring in Internet backbone networks , 2007, Comput. Networks.

[34]  Mark Handley,et al.  TCP Extensions for Multipath Operation with Multiple Addresses , 2011 .

[35]  Vladimir Braverman,et al.  QPipe: quantiles sketch fully in the data plane , 2019, CoNEXT.

[36]  Sujata Banerjee,et al.  ElasticTree: Saving Energy in Data Center Networks , 2010, NSDI.

[37]  Scott Shenker,et al.  Revisiting network support for RDMA , 2018, SIGCOMM.

[38]  Jacob Nelson,et al.  Evaluating the Power of Flexible Packet Processing for Network Resource Allocation , 2017, NSDI.

[39]  Kurt Rothermel,et al.  Dynamic distance maps of the Internet , 2000, Proceedings IEEE INFOCOM 2000. Conference on Computer Communications. Nineteenth Annual Joint Conference of the IEEE Computer and Communications Societies (Cat. No.00CH37064).

[40]  Laurent Vanbever,et al.  Stroboscope: Declarative Network Monitoring on a Budget , 2018, NSDI.

[41]  John M. Mellor-Crummey,et al.  Understanding congestion in high performance interconnection networks using sampling , 2019, SC.

[42]  David Walker,et al.  Compiling Path Queries , 2016, NSDI.

[43]  Bruce G. Lindsay,et al.  Approximate medians and other quantiles in one pass and with limited memory , 1998, SIGMOD '98.

[44]  Amin Vahdat,et al.  TIMELY: RTT-based Congestion Control for the Datacenter , 2015, Comput. Commun. Rev..

[45]  R. Koetter,et al.  The benefits of coding over routing in a randomized setting , 2003, IEEE International Symposium on Information Theory, 2003. Proceedings..

[46]  Rafail Ostrovsky,et al.  A Randomized Online Quantile Summary in O((1/ε) log(1/ε)) Words , 2017, Theory Comput..

[47]  Marco Chiesa,et al.  PURR: A Primitive for Reconfigurable Fast Reroute , 2019 .

[48]  Mark Crovella,et al.  Server selection using dynamic path characterization in wide-area networks , 1997, Proceedings of INFOCOM '97.

[49]  Anirudh Sivaraman,et al.  Language-Directed Hardware Design for Network Performance Monitoring , 2017, SIGCOMM.

[50]  Minlan Yu,et al.  SilkRoad: Making Stateful Layer-4 Load Balancing Fast and Cheap Using Switching ASICs , 2017, SIGCOMM.

[51]  Divyakant Agrawal,et al.  Efficient Computation of Frequent and Top-k Elements in Data Streams , 2005, ICDT.

[52]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[53]  Edo Liberty,et al.  Optimal Quantile Approximation in Streams , 2016, 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).

[54]  Minlan Yu,et al.  Cheetah: Accelerating Database Queries with Switch Pruning , 2019, SIGCOMM Posters and Demos.

[55]  Jennifer Rexford,et al.  Fine-grained queue measurement in the data plane , 2019, CoNEXT.

[56]  D. Newman The Double Dixie Cup Problem , 1960 .

[57]  Minas Gjoka,et al.  A Network Coding Approach to IP Traceback , 2010, 2010 IEEE International Symposium on Network Coding (NetCod).

[58]  Matthew Roughan,et al.  The Internet Topology Zoo , 2011, IEEE Journal on Selected Areas in Communications.

[59]  Junda Liu,et al.  Ensuring connectivity via data plane mechanisms , 2013, NSDI 2013.

[60]  George Varghese,et al.  CONGA: distributed congestion-aware load balancing for datacenters , 2015, SIGCOMM.

[61]  Xenofontas A. Dimitropoulos,et al.  On the Interplay of Link-Flooding Attacks and Traffic Engineering , 2016, CCRV.

[62]  Dawn Xiaodong Song,et al.  Advanced and authenticated marking schemes for IP traceback , 2001, Proceedings IEEE INFOCOM 2001. Conference on Computer Communications. Twentieth Annual Joint Conference of the IEEE Computer and Communications Society (Cat. No.01CH37213).

[63]  Mark Handley,et al.  Congestion control for high bandwidth-delay product networks , 2002, SIGCOMM '02.

[64]  Albert G. Greenberg,et al.  Data center TCP (DCTCP) , 2010, SIGCOMM '10.

[65]  Minlan Yu,et al.  FlowRadar: A Better NetFlow for Data Centers , 2016, NSDI.

[66]  Myungjin Lee,et al.  Distributed Network Monitoring and Debugging with SwitchPointer , 2018, NSDI.

[67]  Laurent Mathy,et al.  Fast userspace packet processing , 2015, 2015 ACM/IEEE Symposium on Architectures for Networking and Communications Systems (ANCS).

[68]  Nick G. Duffield,et al.  Trajectory sampling for direct traffic observation , 2001, TNET.

[69]  Roy Friedman,et al.  Fast flow volume estimation , 2018, Pervasive Mob. Comput..

[70]  Srinivasan Seshan,et al.  FCP: a flexible transport framework for accommodating diversity , 2013, SIGCOMM.

[71]  Amin Vahdat,et al.  SIMON: A Simple and Scalable Method for Sensing, Inference and Measurement in Data Center Networks , 2019, NSDI.

[72]  Steve Uhlig,et al.  OFLOPS-SUME and the Art of Switch Characterization , 2018, IEEE Journal on Selected Areas in Communications.

[73]  Alex C. Snoeren,et al.  Inside the Social Network's (Datacenter) Network , 2015, Comput. Commun. Rev..