Rapid detection of maintenance induced changes in service performance

Service quality in operational IP networks can be impacted due to planned or unplanned maintenance. During any maintenance activity, the responsibility of the operations team is to complete the work order and perform a check-up to ensure there are no unexpected service disruptions. Once the maintenance is complete, it is crucial to continuously monitor the network and look for any performance impacts. What operations lack today are effective tools to rapidly detect maintenance induced performance changes. The large scale and heterogeneity of network elements and performance metrics makes the problem extremely challenging. In this paper, we present PRISM, a new tool for detecting maintenance induced performance changes in a timely fashion. PRISM uses association between maintenance and the network elements to identify performance metrics for time-series analysis. It uses a new <u>M</u>ultiscale <u>R</u>obust <u>L</u>ocal <u>S</u>ubspace algorithm (MRLS) to accurately identify changes in performance even when the baseline is contaminated. We systematically evaluate PRISM using data collected at four large operational networks: a tier-1 backbone, VoIP, IPTV and 3G cellular and show that it achieves good accuracy. We also demonstrate the effectiveness of PRISM in real operational environments through interesting case study findings.

[1]  Fernando Silveira,et al.  URCA: Pulling out Anomalies by their Root Causes , 2010, 2010 Proceedings IEEE INFOCOM.

[2]  Paramvir Bahl,et al.  Towards highly reliable enterprise network services via inference of multi-level dependencies , 2007, SIGCOMM '07.

[3]  Jennifer Rexford,et al.  Sensitivity of PCA for traffic anomaly detection , 2007, SIGMETRICS '07.

[4]  Martin May,et al.  Applying PCA for Traffic Anomaly Detection: Problems and Solutions , 2009, IEEE INFOCOM 2009.

[5]  Yi Ma,et al.  The Augmented Lagrange Multiplier Method for Exact Recovery of Corrupted Low-Rank Matrices , 2010, Journal of structural biology.

[6]  P. Filzmoser,et al.  Algorithms for Projection-Pursuit Robust Principal Component Analysis , 2007 .

[7]  Paul Barford,et al.  BasisDetect: a model-based network event detection framework , 2010, IMC '10.

[8]  Ramesh Govindan,et al.  ASTUTE: detecting a different class of traffic anomalies , 2010, SIGCOMM '10.

[9]  J. P. Park The Identification Of Multiple Outliers , 2000 .

[10]  Albert G. Greenberg,et al.  Network anomography , 2005, IMC '05.

[11]  Srikanth Kandula,et al.  Shrink: a tool for failure diagnosis in IP networks , 2005, MineNet '05.

[12]  A. Zhigljavsky,et al.  Forecasting European industrial production with singular spectrum analysis , 2009 .

[13]  Franck Le,et al.  Minerals: using data mining to detect router misconfigurations , 2006, MineNet '06.

[14]  V. Moskvina,et al.  An Algorithm Based on Singular Spectrum Analysis for Change-Point Detection , 2003 .

[15]  Paramvir Bahl,et al.  Detailed diagnosis in enterprise networks , 2009, SIGCOMM '09.

[16]  Mark Crovella,et al.  Diagnosing network-wide traffic anomalies , 2004, SIGCOMM '04.

[17]  Yi Ma,et al.  Robust principal component analysis? , 2009, JACM.

[18]  Yin Zhang,et al.  Troubleshooting chronic conditions in large IP networks , 2008, CoNEXT '08.

[19]  Albert G. Greenberg,et al.  IP fault localization via risk modeling , 2005, NSDI.

[20]  G. Sapiro,et al.  A collaborative framework for 3D alignment and classification of heterogeneous subvolumes in cryo-electron tomography. , 2013, Journal of structural biology.

[21]  Qi Zhao,et al.  Towards automated performance diagnosis in a large IPTV network , 2009, SIGCOMM '09.

[22]  Kavé Salamatian,et al.  Combining filtering and statistical methods for anomaly detection , 2005, IMC '05.

[23]  Zihui Ge,et al.  Listen to me if you can: tracking user experience of mobile network on social media , 2010, IMC '10.

[24]  Yin Zhang,et al.  Detecting the performance impact of upgrades in large operational networks , 2010, SIGCOMM '10.

[25]  Ling Huang,et al.  ANTIDOTE: understanding and defending against poisoning of anomaly detectors , 2009, IMC '09.

[26]  Tsuyoshi Idé,et al.  Change-Point Detection using Krylov Subspace Learning , 2007, SDM.

[27]  Xu Chen,et al.  Automating Network Application Dependency Discovery: Experiences, Limitations, and New Solutions , 2008, OSDI.

[28]  AmmarMostafa,et al.  Answering what-if deployment and configuration questions with wise , 2008 .

[29]  Mark Crovella,et al.  Mining anomalies using traffic feature distributions , 2005, SIGCOMM '05.

[30]  Nick Feamster,et al.  Diagnosing network disruptions with network-wide analysis , 2007, SIGMETRICS '07.

[31]  Walter Willinger,et al.  Spatio-temporal compressive sensing and internet traffic matrices , 2009, SIGCOMM '09.

[32]  John Wright,et al.  Robust Principal Component Analysis: Exact Recovery of Corrupted Low-Rank Matrices via Convex Optimization , 2009, NIPS.

[33]  Christophe Croux,et al.  High breakdown estimators for principal components: the projection-pursuit approach revisited , 2005 .

[34]  Paul Barford,et al.  A signal analysis of network traffic anomalies , 2002, IMW '02.