deTector: a Topology-aware Monitoring System for Data Center Networks

Troubleshooting network performance issues is a challenging task especially in large-scale data center networks. This paper presents deTector, a network monitoring system that is able to detect and localize network failures (manifested mainly by packet losses) accurately in near real time while minimizing the monitoring overhead. deTector achieves this goal by tightly coupling detection and localization and carefully selecting probe paths so that packet losses can be localized only according to end-to-end observations without the help of additional tools (e.g., tracert). In particular, we quantify the desirable properties of the matrix of probe paths, i.e., coverage and identifiability, and leverage an efficient greedy algorithm with a good approximation ratio and fast speed to select probe paths. We also propose a loss localization method according to loss patterns in a data center network. Our algorithm analysis, experimental evaluation on a Fattree testbed and supplementary large-scale simulation validate the scalability, feasibility and effectiveness of deTector.

[1]  Amin Vahdat,et al.  Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google's Datacenter Network , 2015, Comput. Commun. Rev..

[2]  Herodotos Herodotou,et al.  Scalable near real-time failure localization of data center networks , 2014, KDD.

[3]  Randy H. Katz,et al.  An algebraic approach to practical and scalable overlay network monitoring , 2004, SIGCOMM 2004.

[4]  Myungjin Lee,et al.  Simplifying Datacenter Network Debugging with PathDump , 2016, OSDI.

[5]  Michael Isard,et al.  Autopilot: automatic data center management , 2007, OPSR.

[6]  Myungjin Lee,et al.  CherryPick: tracing packet trajectory in software-defined datacenter networks , 2015, SOSR.

[7]  Albert G. Greenberg,et al.  IP fault localization via risk modeling , 2005, NSDI.

[8]  Lei Shi,et al.  Dcell: a scalable and fault-tolerant network structure for data centers , 2008, SIGCOMM '08.

[9]  Albert G. Greenberg,et al.  Detection and Localization of Network Black Holes , 2007, IEEE INFOCOM 2007 - 26th IEEE International Conference on Computer Communications.

[10]  Albert G. Greenberg,et al.  Ananta: cloud scale load balancing , 2013, SIGCOMM.

[11]  Haitao Wu,et al.  BCube: a high performance, server-centric network architecture for modular data centers , 2009, SIGCOMM '09.

[12]  Donald Beaver,et al.  Dapper, a Large-Scale Distributed Systems Tracing Infrastructure , 2010 .

[13]  Ji Yang,et al.  DesktopDC: setting all programmable data center networking testbed on desk , 2014, SIGCOMM.

[14]  Robert Nowak,et al.  Network Tomography: Recent Developments , 2004 .

[15]  Amin Vahdat,et al.  A scalable, commodity data center network architecture , 2008, SIGCOMM '08.

[16]  George Varghese,et al.  Measuring and Troubleshooting Large Operational Multipath Networks with Gray Box Testing , 2015 .

[17]  Mo Dong,et al.  Towards a flexible data center fabric with source routing , 2015, SOSR.

[18]  Minlan Yu,et al.  LossRadar: Fast Detection of Lost Packets in Data Center Networks , 2016, CoNEXT.

[19]  Ramesh Govindan,et al.  Evolve or Die: High-Availability Design Principles Drawn from Googles Network Infrastructure , 2016, SIGCOMM.

[20]  Haitao Wu,et al.  Explicit Path Control in Commodity Data Centers: Design and Applications , 2016, IEEE/ACM Transactions on Networking.

[21]  Renata Teixeira,et al.  NetDiagnoser: troubleshooting network unreachabilities using end-to-end probes and routing data , 2007, CoNEXT '07.

[22]  Navendu Jain,et al.  Understanding network failures in data centers: measurement, analysis, and implications , 2011, SIGCOMM.

[23]  Jiahua Lu Design of All Programmable Innovation Platform for Software Defined Networking , 2014 .

[24]  Paramvir Bahl,et al.  Detailed diagnosis in enterprise networks , 2009, SIGCOMM '09.

[25]  Igor L. Markov,et al.  Faster symmetry discovery using sparsity of symmetries , 2008, 2008 45th ACM/IEEE Design Automation Conference.

[26]  Hua Chen,et al.  Pingmesh: A Large-Scale System for Data Center Network Latency Measurement and Analysis , 2015, SIGCOMM.

[27]  Haitao Wu,et al.  Generic and automatic address configuration for data center networks , 2010, SIGCOMM 2010.

[28]  Sheng Ma,et al.  Optimizing Probe Selection for Fault Localization , 2001, DSOM.

[29]  Nick G. Duffield,et al.  Network Tomography of Binary Network Performance Characteristics , 2006, IEEE Transactions on Information Theory.

[30]  Kin K. Leung,et al.  Efficient Identification of Additive Link Metrics via Network Tomography , 2013, 2013 IEEE 33rd International Conference on Distributed Computing Systems.

[31]  Andreas Krause,et al.  Cost-effective outbreak detection in networks , 2007, KDD '07.

[32]  Ming Zhang,et al.  Understanding data center traffic characteristics , 2010, CCRV.

[33]  Y. C. Pati,et al.  Orthogonal matching pursuit: recursive function approximation with applications to wavelet decomposition , 1993, Proceedings of 27th Asilomar Conference on Signals, Systems and Computers.

[34]  Paramvir Bahl,et al.  Towards highly reliable enterprise network services via inference of multi-level dependencies , 2007, SIGCOMM.

[35]  Hongbo Zhao,et al.  DesktopDC: setting all programmable data center networking testbed on desk , 2015, SIGCOMM 2015.

[36]  B.K. Dey,et al.  Network tomography via network coding , 2008, 2008 Information Theory and Applications Workshop.