Network-Wide Anomaly Event Detection and Diagnosis With perfSONAR

High-performance computing (HPC) environments supporting data-intensive applications need multidomain network performance measurements from open frameworks such as perfSONAR. Detected network-wide correlated anomaly events that impact data throughput performance need to be quickly and accurately notified along with a root-cause analysis for remediation. In this paper, we present a novel network anomaly events detection and diagnosis scheme for network-wide visibility that improves accuracy of root-cause analysis. We address analysis limitations in cases where there is absence of complete network topology information, and when measurement probes are mis-calibrated leading to erroneous diagnosis. Our proposed scheme fuses perfSONAR time-series path measurements data from multiple domains using principal component analysis (PCA) to transform data for accurate correlated and uncorrelated anomaly events detection. We quantify the certainty of such detection using a measurement data sanity checking that involves: 1) measurement data reputation analysis to qualify the measurement samples and 2) filter framework to prune potentially misleading samples. Lastly, using actual perfSONAR one-way delay measurement traces, we show our proposed scheme's effectiveness in diagnosing the root-cause of critical network performance anomaly events.

[1]  Yingjie Zhou,et al.  Network-Wide Anomaly Detection Based on Router Connection Relationships , 2011, IEICE Trans. Commun..

[2]  Henry Neeman,et al.  The condo of condos , 2013, XSEDE.

[3]  R. Shanmugam Introduction to Time Series and Forecasting , 1997 .

[4]  Antonio Pescapè,et al.  Quality of service statistics over heterogeneous networks: Analysis and applications , 2008, Eur. J. Oper. Res..

[5]  Yin Zhang,et al.  Troubleshooting chronic conditions in large IP networks , 2008, CoNEXT '08.

[6]  Daniel Massey,et al.  Argus: End-to-end service anomaly detection and localization from an ISP's point of view , 2012, 2012 Proceedings IEEE INFOCOM.

[7]  D. Martin Swany,et al.  Research challenges in future multi-domain network performance measurement and monitoring , 2015, CCRV.

[8]  Julita Vassileva,et al.  Trust and reputation model in peer-to-peer networks , 2003, Proceedings Third International Conference on Peer-to-Peer Computing (P2P2003).

[9]  Antonio Pescapè,et al.  Topology Discovery at the Router Level: A New Hybrid Tool Targeting ISP Networks , 2011, IEEE Journal on Selected Areas in Communications.

[10]  David W. Chadwick,et al.  Adding Distributed Trust Management to Shibboleth , 2005 .

[11]  Prasad Calyam,et al.  Topology-Aware Correlated Network Anomaly Event Detection and Diagnosis , 2013, Journal of Network and Systems Management.

[12]  Prasad Calyam,et al.  OnTimeDetect: Dynamic Network Anomaly Notification in perfSONAR Deployments , 2010, 2010 IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[13]  J. E. Jackson,et al.  Control Procedures for Residuals Associated With Principal Component Analysis , 1979 .

[14]  Arnak V. Poghosyan,et al.  An Enterprise Dynamic Thresholding System , 2014, ICAC.

[15]  Giorgos Zacharia,et al.  Trust management through reputation mechanisms , 2000, Appl. Artif. Intell..

[16]  Dan Yang,et al.  Detecting Distributed Network Traffic Anomaly with Network-Wide Correlation Analysis , 2009, EURASIP J. Adv. Signal Process..

[17]  Zihui Ge,et al.  Crowdsourcing service-level network event monitoring , 2010, SIGCOMM '10.

[18]  Anil K. Bera,et al.  A test for normality of observations and regression residuals , 1987 .

[19]  Richard A. Davis,et al.  Introduction to time series and forecasting , 1998 .

[20]  Chang-Gun Lee,et al.  Ontology-Based Semantic Priority Scheduling for Multi-domain Active Measurements , 2013, Journal of Network and Systems Management.

[21]  Ratul Mahajan,et al.  User-level internet path diagnosis , 2003, SOSP '03.

[22]  Christophe Diot,et al.  Diagnosing network-wide traffic anomalies , 2004, SIGCOMM.

[23]  Yannis A. Dimitriadis,et al.  Anomaly Detection in Network Traffic Based on Statistical Inference and \alpha-Stable Modeling , 2011, IEEE Transactions on Dependable and Secure Computing.

[24]  Ehab Al-Shaer,et al.  Problem Localization and Quantification Using Formal Evidential Reasoning for Virtual Networks , 2014, IEEE Transactions on Network and Service Management.

[25]  Ming Zhang,et al.  PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services , 2004, OSDI.

[26]  Jiri Navratil,et al.  Experiences in traceroute and available bandwidth change analysis , 2004, NetT '04.

[27]  Prasad Calyam,et al.  PCA-based network-wide correlated anomaly event detection and diagnosis , 2015, 2015 11th International Conference on the Design of Reliable Communication Networks (DRCN).

[28]  Ehab Al-Shaer,et al.  Reasoning under Uncertainty for Overlay Fault Diagnosis , 2012, IEEE Transactions on Network and Service Management.

[29]  Dario Rossi,et al.  Exploiting packet‐sampling measurements for traffic characterization and classification , 2012, Int. J. Netw. Manag..

[30]  Paul Barford,et al.  Network Performance Anomaly Detection and Localization , 2009, IEEE INFOCOM 2009.

[31]  Jennifer Rexford,et al.  LatLong: Diagnosing Wide-Area Latency Changes for CDNs , 2012, IEEE Transactions on Network and Service Management.

[32]  Å Blockin AUTOMATED EVENT DETECTION FOR ACTIVE MEASUREMENT SYSTEMSevent dete , 2001 .

[33]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[34]  Yuming Jiang,et al.  Assessing the service quality of an Internet path through end-to-end measurement , 2014, Comput. Networks.

[35]  Eric D. Kolaczyk,et al.  A Compressed PCA Subspace Method for Anomaly Detection in High-Dimensional Data , 2011, IEEE Transactions on Information Theory.

[36]  Kavé Salamatian,et al.  Combining filtering and statistical methods for anomaly detection , 2005, IMC '05.

[37]  Arun Venkataramani,et al.  iPlane: an information plane for distributed services , 2006, OSDI '06.

[38]  Vern Paxson,et al.  Strategies for sound internet measurement , 2004, IMC '04.

[39]  D. Martin Swany,et al.  PerfSONAR: A Service Oriented Architecture for Multi-domain Network Monitoring , 2005, ICSOC.

[40]  Piero Castoldi,et al.  Effective Statistical Detection of Smart Confidentiality Attacks in Multi-Domain Networks , 2013, IEEE Transactions on Network and Service Management.

[41]  Partha Kanuparthy,et al.  Pythia: detection, localization, and diagnosis of performance problems , 2013, IEEE Communications Magazine.

[42]  Barry N. Taylor,et al.  Guidelines for Evaluating and Expressing the Uncertainty of Nist Measurement Results , 2017 .