Confluo: Distributed Monitoring and Diagnosis Stack for High-speed Networks

Confluo is an end-host stack that can be integrated with existing network management tools to enable monitoring and diagnosis of network-wide events using telemetry data distributed across end-hosts, even for high-speed networks. Confluo achieves these properties using a new data structure — Atomic MultiLog— that supports highly-concurrent read-write operations by exploiting two properties specific to telemetry data: (1) once processed by the stack, the data is neither updated nor deleted; and (2) each field in the data has a fixed pre-defined size. Our evaluation results show that, for packet sizes 128B or larger, Confluo executes thousands of triggers and tens of filters at line rate (for 10Gbps links) using a single core.

[1]  Ramesh Govindan,et al.  Trumpet: Timely and Precise Triggers in Data Centers , 2016, SIGCOMM.

[2]  Vladimir Braverman,et al.  One Sketch to Rule Them All: Rethinking Network Flow Monitoring with UnivMon , 2016, SIGCOMM.

[3]  Hagit Attiya,et al.  Atomic Snapshots in O(n log n) Operations , 1998, SIAM J. Comput..

[4]  Anja Feldmann,et al.  Enriching network security analysis with time travel , 2008, SIGCOMM '08.

[5]  Sylvia Ratnasamy,et al.  SoftNIC: A Software NIC to Augment Hardware , 2015 .

[6]  Myungjin Lee,et al.  Simplifying Datacenter Network Debugging with PathDump , 2016, OSDI.

[7]  Badrish Chandramouli,et al.  FASTER: A Concurrent Key-Value Store with In-Place Updates , 2018, SIGMOD Conference.

[8]  Rodrigo Fonseca,et al.  Planck , 2014, SIGCOMM.

[9]  Michail Vlachos,et al.  Net-Fli: On-the-fly Compression, Archiving and Indexing of Streaming Network Traffic , 2010, Proc. VLDB Endow..

[10]  Myungjin Lee,et al.  CherryPick: tracing packet trajectory in software-defined datacenter networks , 2015, SOSR.

[11]  Ophir Rachman,et al.  Atomic snapshots using lattice agreement , 1995, Distributed Computing.

[12]  Ramana Rao Kompella,et al.  The TCP Outcast Problem: Exposing Unfairness in Data Center Networks , 2012, NSDI.

[13]  Scott Shenker,et al.  NetBricks: Taking the V out of NFV , 2016, OSDI.

[14]  Mark Sullivan,et al.  Tribeca: A Stream Database Manager for Network Traffic Analysis , 1996, VLDB.

[15]  Sungryoul Lee,et al.  FloSIS: A Highly Scalable Network Flow Capture System for Fast Retrieval and Storage Efficiency , 2015, USENIX Annual Technical Conference.

[16]  Yi Zhang,et al.  A simple, fast and scalable non-blocking concurrent FIFO queue for shared memory multiprocessor systems , 2001, SPAA '01.

[17]  Marcos K. Aguilera,et al.  Black-box Concurrent Data Structures for NUMA Architectures , 2017, ASPLOS.

[18]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[19]  Hugh E. Williams,et al.  Burst tries: a fast, efficient data structure for string keys , 2002, TOIS.

[20]  Minlan Yu,et al.  Software Defined Traffic Measurement with OpenSketch , 2013, NSDI.

[21]  Marcos K. Aguilera,et al.  Black-box Concurrent Data Structures for NUMA Architectures , 2017, ASPLOS.

[22]  Martin Odersky,et al.  Concurrent tries with efficient non-blocking snapshots , 2012, PPoPP '12.

[23]  Idit Keidar,et al.  Scaling concurrent log-structured data stores , 2015, EuroSys.

[24]  Donald R. Morrison,et al.  PATRICIA—Practical Algorithm To Retrieve Information Coded in Alphanumeric , 1968, J. ACM.

[25]  Anirudh Sivaraman,et al.  Language-Directed Hardware Design for Network Performance Monitoring , 2017, SIGCOMM.

[26]  Maurice Herlihy,et al.  Linearizability: a correctness condition for concurrent objects , 1990, TOPL.

[27]  Minlan Yu,et al.  FlowRadar: A Better NetFlow for Data Centers , 2016, NSDI.

[28]  Myungjin Lee,et al.  Distributed Network Monitoring and Debugging with SwitchPointer , 2018, NSDI.

[29]  Jake Silverman,et al.  Felix: Implementing Traffic Measurement on End Hosts Using Program Analysis , 2016, SOSR.

[30]  Nir Shavit,et al.  Atomic snapshots of shared memory , 1990, JACM.

[31]  Alex C. Snoeren,et al.  Passive Realtime Datacenter Fault Detection and Localization , 2017, NSDI.

[32]  Tao Zou,et al.  Tango: distributed data structures over a shared log , 2013, SOSP.

[33]  Erez Petrank,et al.  A lock-free B+tree , 2012, SPAA '12.

[34]  Ittai Abraham,et al.  vCorfu: A Cloud-Scale Object Store on a Shared Log , 2017, NSDI.

[35]  Ming Zhang,et al.  Understanding data center traffic characteristics , 2010, CCRV.

[36]  R. Bayer,et al.  Organization and maintenance of large ordered indices , 1970, SIGFIDET '70.

[37]  Behnaz Arzani,et al.  007: Democratically Finding The Cause of Packet Drops , 2018, NSDI.

[38]  Ranjan Sinha,et al.  HAT-Trie: A Cache-Conscious Trie-Based Data Structure For Strings , 2007, ACSC.

[39]  Ben Y. Zhao,et al.  Packet-Level Telemetry in Large Datacenter Networks , 2015, SIGCOMM.

[40]  Nick McKeown,et al.  I Know What Your Packet Did Last Hop: Using Packet Histories to Troubleshoot Networks , 2014, NSDI.

[41]  Nasir D. Memon,et al.  NetStore: An Efficient Storage Infrastructure for Network Forensics and Monitoring , 2010, RAID.

[42]  Walter Willinger,et al.  Sonata: query-driven streaming network telemetry , 2018, SIGCOMM.

[43]  Theodore Johnson,et al.  Gigascope: a stream database for network applications , 2003, SIGMOD '03.

[44]  Dahlia Malkhi,et al.  CORFU: A Shared Log Design for Flash Clusters , 2012, NSDI.

[45]  Ion Stoica,et al.  Succinct: Enabling Queries on Compressed Data , 2015, NSDI.

[46]  Xin Jin,et al.  SketchVisor: Robust Network Measurement for Software Packet Processing , 2017, SIGCOMM.

[47]  David Walker,et al.  Compiling Path Queries , 2016, NSDI.