Zero-CPU Collection with Direct Telemetry Access

Programmable switches are driving a massive increase in fine-grained measurements. This puts significant pressure on telemetry collectors that have to process reports from many switches. Past research acknowledged this problem by either improving collectors' stack performance or by limiting the amount of data sent from switches. In this paper, we take a different and radical approach: switches are responsible for directly inserting queryable telemetry data into the collectors' memory, bypassing their CPU, and thereby improving their collection scalability. We propose to use a method we call direct telemetry access, where switches jointly write telemetry reports directly into the same collector's memory region, without coordination. Our solution, DART, is probabilistic, trading memory redundancy and query success probability for CPU resources at collectors. We prototype DART using commodity hardware such as P4 switches and RDMA NICs and show that we get high query success rates with a reasonable memory overhead. For example, we can collect INT path tracing information on a fat tree topology without a collector's CPU involvement while achieving 99.9% query success probability and using just 300 bytes per flow.

[1]  Andrei Broder,et al.  Network Applications of Bloom Filters: A Survey , 2004, Internet Math..

[2]  Rodrigo Fonseca,et al.  Planck , 2014, SIGCOMM.

[3]  Srinivasan Seshan,et al.  TEA: Enabling State-Intensive Network Functions on Programmable Switches , 2020, SIGCOMM.

[4]  Sujata Banerjee,et al.  ElasticTree: Saving Energy in Data Center Networks , 2010, NSDI.

[5]  K. Raza Juniper Networks , 2009 .

[6]  Nick McKeown,et al.  I Know What Your Packet Did Last Hop: Using Packet Histories to Troubleshoot Networks , 2014, NSDI.

[7]  Ion Stoica,et al.  Confluo: Distributed Monitoring and Diagnosis Stack for High-speed Networks , 2019, NSDI.

[8]  Jeffrey D. Case,et al.  Simple Network Management Protocol (SNMP) , 1989, RFC.

[9]  Minlan Yu,et al.  PINT: Probabilistic In-band Network Telemetry , 2020, SIGCOMM.

[10]  Walter Willinger,et al.  Sonata: query-driven streaming network telemetry , 2018, SIGCOMM.

[11]  Ori Rottenstreich,et al.  Designing Heavy-Hitter Detection Algorithms for Programmable Switches , 2020, IEEE/ACM Transactions on Networking.

[12]  David Mazières,et al.  Millions of Little Minions: Using Packets for Low Latency Network Programming and Visibility (Extended Version) , 2014, SIGCOMM 2015.

[13]  David Sidler,et al.  StRoM: smart remote memory , 2020, EuroSys.

[14]  Behnaz Arzani,et al.  dShark: A General, Easy to Program and Scalable Framework for Analyzing In-network Packet Traces , 2019, NSDI.

[15]  Ben Y. Zhao,et al.  Packet-Level Telemetry in Large Datacenter Networks , 2015, SIGCOMM.

[16]  Minlan Yu,et al.  Network telemetry: towards a top-down approach , 2019, CCRV.

[17]  Diana Andreea Popescu,et al.  Enabling Event-Triggered Data Plane Monitoring , 2020, SOSR.

[18]  Nikolaos Hardavellas,et al.  The Rise and Fall of Dark Silicon , 2012, login Usenix Mag..

[19]  StRoM , 2020, Proceedings of the Fifteenth European Conference on Computer Systems.

[20]  George Varghese,et al.  CONGA: distributed congestion-aware load balancing for datacenters , 2015, SIGCOMM.

[21]  George Varghese,et al.  P4: programming protocol-independent packet processors , 2013, CCRV.

[22]  Wei Bai,et al.  OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy , 2020, SIGCOMM.

[23]  Vladimir Braverman,et al.  One Sketch to Rule Them All: Rethinking Network Flow Monitoring with UnivMon , 2016, SIGCOMM.

[24]  Anirudh Sivaraman,et al.  In-band Network Telemetry via Programmable Dataplanes , 2015 .

[25]  Brent E. Stephens,et al.  Planck , 2014, ACM SIGCOMM Computer Communication Review.

[26]  J. Rexford,et al.  Sonata , 2018, Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication.

[27]  Hua Chen,et al.  Pingmesh: A Large-Scale System for Data Center Network Latency Measurement and Analysis , 2015, SIGCOMM.

[28]  Hua Chen,et al.  Pingmesh: A Large-Scale System for Data Center Network Latency Measurement and Analysis , 2015, SIGCOMM.

[29]  Wassim Mansour,et al.  FPGA Implementation of RDMA-Based Data Acquisition System Over 100-Gb Ethernet , 2019, IEEE Transactions on Nuclear Science.

[30]  Alex C. Snoeren,et al.  Inside the Social Network's (Datacenter) Network , 2015, Comput. Commun. Rev..

[31]  Karthikeyan Sankaralingam,et al.  Dark silicon and the end of multicore scaling , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[32]  Yangyang Wang,et al.  HyperSight: Towards Scalable, High-Coverage, and Dynamic Network Monitoring Queries , 2020, IEEE Journal on Selected Areas in Communications.

[33]  S. Shenker,et al.  Remote Memory Calls , 2020, HotNets.

[34]  Pengcheng Zhang,et al.  Flow Event Telemetry on Programmable Data Plane , 2020, SIGCOMM.

[35]  James Won-Ki Hong,et al.  Towards ONOS-based SDN monitoring using in-band network telemetry , 2017, 2017 19th Asia-Pacific Network Operations and Management Symposium (APNOMS).

[36]  Michael T. Goodrich,et al.  Invertible bloom lookup tables , 2011, 2011 49th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[37]  Jae-Hyoung Yoo,et al.  INTCollector: A High-performance Collector for In-band Network Telemetry , 2018, 2018 14th International Conference on Network and Service Management (CNSM).

[38]  W. Xu,et al.  Concerto: cooperative network-wide telemetry with controllable error rate , 2020, APSys.

[39]  Minlan Yu,et al.  Routing Oblivious Measurement Analytics , 2020, 2020 IFIP Networking Conference (Networking).

[40]  Richard J. Lipton,et al.  A New Approach To Information Theory , 1994, STACS.

[41]  TEA , 2020, Proceedings of the Annual conference of the ACM Special Interest Group on Data Communication on the applications, technologies, architectures, and protocols for computer communication.

[42]  Deval Bhamare,et al.  Programmable Event Detection for In-Band Network Telemetry , 2019, 2019 IEEE 8th International Conference on Cloud Networking (CloudNet).

[43]  Matthias Sax,et al.  Apache Kafka , 2019, Encyclopedia of Big Data Technologies.

[44]  Myungjin Lee,et al.  Distributed Network Monitoring and Debugging with SwitchPointer , 2018, NSDI.

[45]  Minlan Yu,et al.  HPCC: high precision congestion control , 2019, SIGCOMM.