NetCruiser: Localize Network Failures by Learning from Latency Data

In modern data center networks (DCNs), failures of network devices always occur and it is difficult to localize these failures. Our key observation is that latency data can reflect and profile network status. We can use this information to resolve issues like network failure localization.In this paper, we present NetCruiser, a system that is able to localize failures by learning from latency data. It can both measure and collect latency data to monitor the status of the whole network and pinpoint which switch or router encounters a failure. And we design a data structure to handle these latency data. With the construction of this data structure, we build a machine learning model to infer where issue occurs. Therefore, by the usage of this system, it answers the question about which switch encounters a failure in network. Our experimental evaluation has validated both the efficiency and effectiveness of our approach. Our system can be widely applied to both inter-DC network and intra-DC network.

[1]  Alex C. Snoeren,et al.  Passive Realtime Datacenter Fault Detection and Localization , 2017, NSDI.

[2]  Shenglin Zhang,et al.  PreFix: Switch Failure Prediction in Datacenter Networks , 2018, Proc. ACM Meas. Anal. Comput. Syst..

[3]  Vladimir Braverman,et al.  One Sketch to Rule Them All: Rethinking Network Flow Monitoring with UnivMon , 2016, SIGCOMM.

[4]  Nick McKeown,et al.  I Know What Your Packet Did Last Hop: Using Packet Histories to Troubleshoot Networks , 2014, NSDI.

[5]  Xin Jin,et al.  SketchVisor: Robust Network Measurement for Software Packet Processing , 2017, SIGCOMM.

[6]  David Walker,et al.  Compiling Path Queries , 2016, NSDI.

[7]  Hua Chen,et al.  Pingmesh: A Large-Scale System for Data Center Network Latency Measurement and Analysis , 2015, SIGCOMM.

[8]  David Walker,et al.  Frenetic: a network programming language , 2011, ICFP.

[9]  Christian S. Jensen,et al.  Outlier Detection for Multidimensional Time Series Using Deep Neural Networks , 2018, 2018 19th IEEE International Conference on Mobile Data Management (MDM).

[10]  Walter Willinger,et al.  Network Monitoring as a Streaming Analytics Problem , 2016, HotNets.

[11]  Minlan Yu,et al.  FlowRadar: A Better NetFlow for Data Centers , 2016, NSDI.

[12]  Minlan Yu,et al.  Software Defined Traffic Measurement with OpenSketch , 2013, NSDI.

[13]  Haitao Wu,et al.  NetBouncer: Active Device and Link Failure Localization in Data Center Networks , 2019, NSDI.

[14]  Anirudh Sivaraman,et al.  Language-Directed Hardware Design for Network Performance Monitoring , 2017, SIGCOMM.

[15]  Alex C. Snoeren,et al.  Inside the Social Network's (Datacenter) Network , 2015, Comput. Commun. Rev..

[16]  Herodotos Herodotou,et al.  Scalable near real-time failure localization of data center networks , 2014, KDD.

[17]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[18]  Amin Vahdat,et al.  A scalable, commodity data center network architecture , 2008, SIGCOMM '08.

[19]  Albert G. Greenberg,et al.  VL2: a scalable and flexible data center network , 2009, SIGCOMM '09.

[20]  Nathan Farrington,et al.  Facebook's data center network architecture , 2013, 2013 Optical Interconnects Conference.