Robust feature selection and robust PCA for internet traffic anomaly detection

Robust statistics is a branch of statistics which includes statistical methods capable of dealing adequately with the presence of outliers. In this paper, we propose an anomaly detection method that combines a feature selection algorithm and an outlier detection method, which makes extensive use of robust statistics. Feature selection is based on a mutual information metric for which we have developed a robust estimator; it also includes a novel and automatic procedure for determining the number of relevant features. Outlier detection is based on robust Principal Component Analysis (PCA) which, opposite to classical PCA, is not sensitive to outliers and precludes the necessity of training using a reliably labeled dataset, a strong advantage from the operational point of view. To evaluate our method we designed a network scenario capable of producing a perfect ground-truth under real (but controlled) traffic conditions. Results show the significant improvements of our method over the corresponding classical ones. Moreover, despite being a largely overlooked issue in the context of anomaly detection, feature selection is found to be an important preprocessing step, allowing adaption to different network conditions and inducing significant performance gains.

[1]  Sally Floyd,et al.  Difficulties in simulating the internet , 2001, TNET.

[2]  Matthew Roughan,et al.  The need for simulation in evaluating anomaly detectors , 2008, CCRV.

[3]  Mia Hubert,et al.  Computational Statistics and Data Analysis Robust Pca for Skewed Data and Its Outlier Map , 2022 .

[4]  Wenjie Hu,et al.  Robust Anomaly Detection Using Support Vector Machines , 2003 .

[5]  V. Yohai,et al.  Robust Statistics: Theory and Methods , 2006 .

[6]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[7]  Ling Huang,et al.  ANTIDOTE: understanding and defending against poisoning of anomaly detectors , 2009, IMC '09.

[8]  Huan Liu,et al.  Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution , 2003, ICML.

[9]  Jennifer Rexford,et al.  Sensitivity of PCA for traffic anomaly detection , 2007, SIGMETRICS '07.

[10]  Martin May,et al.  Applying PCA for Traffic Anomaly Detection: Problems and Solutions , 2009, IEEE INFOCOM 2009.

[11]  Luca Salgarelli,et al.  On the stability of the information carried by traffic flow features at the packet level , 2009, CCRV.

[12]  Konstantina Papagiannaki,et al.  Structural analysis of network traffic flows , 2004, SIGMETRICS '04/Performance '04.

[13]  Vern Paxson,et al.  Outside the Closed World: On Using Machine Learning for Network Intrusion Detection , 2010, 2010 IEEE Symposium on Security and Privacy.

[14]  Andrew W. Moore,et al.  Internet traffic classification using bayesian analysis techniques , 2005, SIGMETRICS '05.

[15]  H. Fritz,et al.  EXPLORING HIGH-DIMENSIONAL DATA WITH ROBUST PRINCIPAL COMPONENTS , 2007 .

[16]  P. Filzmoser,et al.  Algorithms for Projection-Pursuit Robust Principal Component Analysis , 2007 .

[17]  M. Shyu,et al.  A Novel Anomaly Detection Scheme Based on Principal Component Classifier , 2003 .

[18]  Ian T. Jolliffe,et al.  Principal Component Analysis , 2002, International Encyclopedia of Statistical Science.

[19]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[20]  Peter Filzmoser,et al.  Outlier identification in high dimensions , 2008, Comput. Stat. Data Anal..

[21]  Christophe Croux,et al.  High breakdown estimators for principal components: the projection-pursuit approach revisited , 2005 .

[22]  Marcel J. T. Reinders,et al.  Random subspace method for multivariate feature selection , 2006, Pattern Recognit. Lett..

[23]  António Pacheco,et al.  Detection of Outliers Using Robust Principal Component Analysis: A Simulation Study , 2010, SMPS.

[24]  Lorenzo Leija,et al.  Mutual information and intrinsic dimensionality for feature selection , 2010, 2010 7th International Conference on Electrical Engineering Computing Science and Automatic Control.

[25]  Michalis Faloutsos,et al.  Internet traffic classification demystified: myths, caveats, and the best practices , 2008, CoNEXT '08.

[26]  Farnam Jahanian,et al.  A comparative study of two network-based anomaly detection methods , 2011, 2011 Proceedings IEEE INFOCOM.

[27]  Mia Hubert,et al.  ROBPCA: A New Approach to Robust Principal Component Analysis , 2005, Technometrics.

[28]  Mia Hubert,et al.  Robust PCA and classification in biosciences , 2004, Bioinform..

[29]  Yan Li,et al.  Estimation of Mutual Information: A Survey , 2009, RSKT.