OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy

Network telemetry is essential for administrators to monitor massive data traffic in a network-wide manner. Existing telemetry solutions often face the dilemma between resource efficiency (i.e., low CPU, memory, and bandwidth overhead) and full accuracy (i.e., error-free and holistic measurement). We break this dilemma via a network-wide architectural design OmniMon, which simultaneously achieves resource efficiency and full accuracy in flow-level telemetry for large-scale data centers. OmniMon carefully coordinates the collaboration among different types of entities in the whole network to execute telemetry operations, such that the resource constraints of each entity are satisfied without compromising full accuracy. It further addresses consistency in network-wide epoch synchronization and accountability in error-free packet loss inference. We prototype OmniMon in DPDK and P4. Testbed experiments on commodity servers and Tofino switches demonstrate the effectiveness of OmniMon over state-of-the-art telemetry designs.

[1]  Xin Jin,et al.  SketchVisor: Robust Network Measurement for Software Packet Processing , 2017, SIGCOMM.

[2]  Minlan Yu,et al.  A Comparison of Performance and Accuracy of Measurement Algorithms in Software , 2018, SOSR.

[3]  David Walker,et al.  Compiling Path Queries , 2016, NSDI.

[4]  Nick McKeown,et al.  I Know What Your Packet Did Last Hop: Using Packet Histories to Troubleshoot Networks , 2014, NSDI.

[5]  Chen-Nee Chuah,et al.  ProgME: Towards Programmable Network MEasurement , 2007, IEEE/ACM Transactions on Networking.

[6]  Walter Willinger,et al.  Sonata: query-driven streaming network telemetry , 2018, SIGCOMM.

[7]  Andrew W. Moore,et al.  Internet traffic classification using bayesian analysis techniques , 2005, SIGMETRICS '05.

[8]  Hua Chen,et al.  Pingmesh: A Large-Scale System for Data Center Network Latency Measurement and Analysis , 2015, SIGCOMM.

[9]  Peng Liu,et al.  Elastic sketch: adaptive and fast network-wide measurements , 2018, SIGCOMM.

[10]  George Varghese,et al.  CONGA: distributed congestion-aware load balancing for datacenters , 2015, SIGCOMM.

[11]  Ramesh Govindan,et al.  SCREAM: sketch resource allocation for software-defined measurement , 2015, CoNEXT.

[12]  Minlan Yu,et al.  Enforcing Network-Wide Policies in the Presence of Dynamic Middlebox Actions using FlowTags , 2014, NSDI.

[13]  Minlan Yu,et al.  LossRadar: Fast Detection of Lost Packets in Data Center Networks , 2016, CoNEXT.

[14]  Albert G. Greenberg,et al.  Data center TCP (DCTCP) , 2010, SIGCOMM '10.

[15]  Rodrigo Fonseca,et al.  Planck , 2014, SIGCOMM.

[16]  Behnaz Arzani,et al.  007: Democratically Finding The Cause of Packet Drops , 2018, NSDI.

[17]  Walter Willinger,et al.  cSamp: A System for Network-Wide Flow Monitoring , 2008, NSDI.

[18]  Minlan Yu,et al.  FlowRadar: A Better NetFlow for Data Centers , 2016, NSDI.

[19]  Alex C. Snoeren,et al.  Passive Realtime Datacenter Fault Detection and Localization , 2017, NSDI.

[20]  Myungjin Lee,et al.  Distributed Network Monitoring and Debugging with SwitchPointer , 2018, NSDI.

[21]  David A. Maltz,et al.  Network traffic characteristics of data centers in the wild , 2010, IMC '10.

[22]  Matthew Roughan,et al.  Network link tomography and compressive sensing , 2011, SIGMETRICS '11.

[23]  Ramesh Govindan,et al.  DREAM , 2014, SIGCOMM.

[24]  Ben Y. Zhao,et al.  Packet-Level Telemetry in Large Datacenter Networks , 2015, SIGCOMM.

[25]  Amin Vahdat,et al.  Exploiting a Natural Network Effect for Scalable, Fine-grained Clock Synchronization , 2018, NSDI.

[26]  Feng Qian,et al.  An in-depth study of LTE: effect of network protocol and application behavior on performance , 2013, SIGCOMM.

[27]  Alex C. Snoeren,et al.  Inside the Social Network's (Datacenter) Network , 2015, Comput. Commun. Rev..

[28]  Samuel T. King,et al.  ReVirt: enabling intrusion analysis through virtual-machine logging and replay , 2002, OPSR.

[29]  S. Muthukrishnan,et al.  Heavy-Hitter Detection Entirely in the Data Plane , 2016, SOSR.

[30]  Jeffrey D. Case,et al.  Simple Network Management Protocol (SNMP) , 1989, RFC.

[31]  Vladimir Braverman,et al.  One Sketch to Rule Them All: Rethinking Network Flow Monitoring with UnivMon , 2016, SIGCOMM.

[32]  Vyas Sekar,et al.  Revisiting the case for a minimalist approach for network flow monitoring , 2010, IMC '10.

[33]  Haitao Wu,et al.  NetBouncer: Active Device and Link Failure Localization in Data Center Networks , 2019, NSDI.

[34]  Yanghee Choi,et al.  Internet traffic classification demystified: on the sources of the discriminative power , 2010, CoNEXT.

[35]  Patrick P. C. Lee,et al.  Sketchlearn: relieving user burdens in approximate measurement with automated statistical inference , 2018, SIGCOMM.

[36]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[37]  Jin Cao,et al.  Tracking Long Duration Flows in Network Traffic , 2010, 2010 Proceedings IEEE INFOCOM.

[38]  Arpit Gupta,et al.  Network-Wide Heavy Hitter Detection with Commodity Switches , 2018, SOSR.

[39]  Roy Friedman,et al.  Constant Time Updates in Hierarchical Heavy Hitters , 2017, SIGCOMM.

[40]  Srinivasan Seshan,et al.  Generic External Memory for Switch Data Planes , 2018, HotNets.

[41]  Leslie Lamport,et al.  Paxos Made Simple , 2001 .

[42]  Michalis Faloutsos,et al.  Internet traffic classification demystified: myths, caveats, and the best practices , 2008, CoNEXT '08.

[43]  Mahadev Konar,et al.  ZooKeeper: Wait-free Coordination for Internet-scale Systems , 2010, USENIX ATC.

[44]  Ion Stoica,et al.  Confluo: Distributed Monitoring and Diagnosis Stack for High-speed Networks , 2019, NSDI.

[45]  Minlan Yu,et al.  Software Defined Traffic Measurement with OpenSketch , 2013, NSDI.

[46]  Laurent Vanbever,et al.  Swing State: Consistent Updates for Stateful and Programmable Data Planes , 2017, SOSR.

[47]  Ramesh Govindan,et al.  Trumpet: Timely and Precise Triggers in Data Centers , 2016, SIGCOMM.

[48]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[49]  Sebastian Zander,et al.  A preliminary performance comparison of five machine learning algorithms for practical IP traffic flow classification , 2006, CCRV.

[50]  Minlan Yu,et al.  MOZART: Temporal Coordination of Measurement , 2016, SOSR.

[51]  Vincent Liu,et al.  Synchronized network snapshots , 2018, SIGCOMM.

[52]  Arvind Krishnamurthy,et al.  High-resolution measurement of data center microbursts , 2017, Internet Measurement Conference.

[53]  Katerina J. Argyraki,et al.  Network tomography on correlated links , 2010, IMC '10.

[54]  Eunyoung Jeong,et al.  mTCP: a Highly Scalable User-level TCP Stack for Multicore Systems , 2014, NSDI.

[55]  Rasmus Pagh,et al.  Cuckoo Hashing , 2001, Encyclopedia of Algorithms.

[56]  Behnaz Arzani,et al.  dShark: A General, Easy to Program and Scalable Framework for Analyzing In-network Packet Traces , 2019, NSDI.

[57]  Fred B. Schneider,et al.  Hypervisor-based fault tolerance , 1996, TOCS.

[58]  Anirudh Sivaraman,et al.  Language-Directed Hardware Design for Network Performance Monitoring , 2017, SIGCOMM.

[59]  Minlan Yu,et al.  SilkRoad: Making Stateful Layer-4 Load Balancing Fast and Cheap Using Switching ASICs , 2017, SIGCOMM.

[60]  Katerina J. Argyraki,et al.  RouteBricks: exploiting parallelism to scale software routers , 2009, SOSP '09.

[61]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.