Detecting the performance impact of upgrades in large operational networks

Networks continue to change to support new applications, improve reliability and performance and reduce the operational cost. The changes are made to the network in the form of upgrades such as software or hardware upgrades, new network or service features and network configuration changes. It is crucial to monitor the network when upgrades are made because they can have a significant impact on network performance and if not monitored may lead to unexpected consequences in operational networks. This can be achieved manually for a small number of devices, but does not scale to large networks with hundreds or thousands of routers and extremely large number of different upgrades made on a regular basis. In this paper, we design and implement a novel infrastructure MERCURY for detecting the impact of network upgrades (or triggers) on performance. MERCURY extracts interesting triggers from a large number of network maintenance activities. It then identifies behavior changes in network performance caused by the triggers. It uses statistical rule mining and network configuration to identify commonality across the behavior changes. We systematically evaluate MERCURY using data collected at a large tier-1 ISP network. By comparing to operational practice, we show that MERCURY is able to capture the interesting triggers and behavior changes induced by the triggers. In some cases, MERCURY also discovers previously unknown network behaviors demonstrating the effectiveness in identifying network conditions flying under the radar.

[1]  Albert G. Greenberg,et al.  Network anomography , 2005, IMC '05.

[2]  Srikanth Kandula,et al.  Shrink: a tool for failure diagnosis in IP networks , 2005, MineNet '05.

[3]  W. A. Taylor Change-Point Analysis : A Powerful New Tool For Detecting Changes , 2000 .

[4]  Carsten Lund,et al.  Modeling and understanding end-to-end class of service policies in operational networks , 2009, SIGCOMM '09.

[5]  Nick Feamster,et al.  Practical issues with using network tomography for fault diagnosis , 2008, CCRV.

[6]  Christophe Diot,et al.  Diagnosing network-wide traffic anomalies , 2004, SIGCOMM.

[7]  AmmarMostafa,et al.  Answering what-if deployment and configuration questions with wise , 2008 .

[8]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[9]  Franck Le,et al.  Minerals: using data mining to detect router misconfigurations , 2006, MineNet '06.

[10]  Geoffrey M. Voelker,et al.  NetPrints: Diagnosing Home Network Misconfigurations Using Shared Knowledge , 2009, NSDI.

[11]  Steven D. Gribble,et al.  Configuration Debugging as Search: Finding the Needle in the Haystack , 2004, OSDI.

[12]  Craig A. Finseth,et al.  An Access Control Protocol, Sometimes Called TACACS , 1993, RFC.

[13]  Franck Le,et al.  Shedding light on the glue logic of the internet routing architecture , 2008, SIGCOMM '08.

[14]  Renata Teixeira,et al.  NetDiagnoser: troubleshooting network unreachabilities using end-to-end probes and routing data , 2007, CoNEXT '07.

[15]  Albert G. Greenberg,et al.  Configuration management at massive scale: system design and experience , 2007, IEEE Journal on Selected Areas in Communications.

[16]  E. Forgy,et al.  Cluster analysis of multivariate data : efficiency versus interpretability of classifications , 1965 .

[17]  David A. Maltz,et al.  Routing design in operational networks: a look from the inside , 2004, SIGCOMM.

[18]  Qi Zhao,et al.  Towards automated performance diagnosis in a large IPTV network , 2009, SIGCOMM '09.

[19]  Ramesh Govindan,et al.  ASTUTE: detecting a different class of traffic anomalies , 2010, SIGCOMM '10.

[20]  Ye Wang,et al.  Shadow configuration as a network management primitive , 2008, SIGCOMM '08.

[21]  Ranveer Chandra,et al.  What's going on?: learning communication rules in edge networks , 2008, SIGCOMM '08.

[22]  Fernando Silveira,et al.  URCA: Pulling out Anomalies by their Root Causes , 2010, 2010 Proceedings IEEE INFOCOM.

[23]  Helen J. Wang,et al.  Automatic Misconfiguration Troubleshooting with PeerPressure , 2004, OSDI.

[24]  Yin Zhang,et al.  Troubleshooting chronic conditions in large IP networks , 2008, CoNEXT '08.

[25]  Ítalo S. Cunha,et al.  Measurement methods for fast and accurate blackhole identification with binary tomography , 2009, IMC '09.

[26]  Jennifer Rexford,et al.  Sensitivity of PCA for traffic anomaly detection , 2007, SIGMETRICS '07.

[27]  David A. Maltz,et al.  Unraveling the Complexity of Network Management , 2009, NSDI.

[28]  Paramvir Bahl,et al.  Detailed diagnosis in enterprise networks , 2009, SIGCOMM '09.

[29]  Mark Crovella,et al.  Mining anomalies using traffic feature distributions , 2005, SIGCOMM '05.

[30]  Nick Feamster,et al.  Diagnosing network disruptions with network-wide analysis , 2007, SIGMETRICS '07.

[31]  Walter Willinger,et al.  Spatio-temporal compressive sensing and internet traffic matrices , 2009, SIGCOMM '09.

[32]  Helen J. Wang,et al.  Strider: a black-box, state-based approach to change and configuration management and support , 2003, Sci. Comput. Program..

[33]  Divesh Srivastava,et al.  Diamond in the rough: finding Hierarchical Heavy Hitters in multi-dimensional data , 2004, SIGMOD '04.

[34]  Stephen R. Garner,et al.  WEKA: The Waikato Environment for Knowledge Analysis , 1996 .

[35]  Xu Chen,et al.  Automating Network Application Dependency Discovery: Experiences, Limitations, and New Solutions , 2008, OSDI.

[36]  Michael I. Jordan,et al.  Detecting large-scale system problems by mining console logs , 2009, SOSP '09.

[37]  Nick Feamster,et al.  Detecting BGP configuration faults with static analysis , 2005 .

[38]  Paramvir Bahl,et al.  Towards highly reliable enterprise network services via inference of multi-level dependencies , 2007, SIGCOMM.

[39]  Carsten Lund,et al.  Online identification of hierarchical heavy hitters: algorithms, evaluation, and applications , 2004, IMC '04.

[40]  R. Tsay Outliers, Level Shifts, and Variance Changes in Time Series , 1988 .

[41]  Albert G. Greenberg,et al.  IP fault localization via risk modeling , 2005, NSDI.