Distributed PCA and Network Anomaly Detection

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission. Abstract We consider the problem of network anomaly detection given the data collected and processed over large distributed systems. Our algorithmic framework can be seen as an approximate , distributed version of the well-known Principal Component Analysis (PCA) method, which is concerned with continuously tracking the behavior of the data projected onto the residual subspace of the principal components within error bound guarantees. Our approach consists of a protocol for local processing at individual monitoring devices, and global decision-making and monitoring feedback at a coordinator. A key ingredient of our framework is an analytical method based on stochastic matrix perturbation theory for balancing the tradeoff between the accuracy of our approximate network anomaly detection, and the amount of data communication over the network.

[1]  J. E. Jackson,et al.  Control Procedures for Residuals Associated With Principal Component Analysis , 1979 .

[2]  G. Stewart,et al.  Matrix Perturbation Theory , 1990 .

[3]  Michael I. Jordan,et al.  Nonparametric decentralized detection using kernel methods , 2005, IEEE Transactions on Signal Processing.

[4]  Sriram Ramabhadran,et al.  NetProfiler: Profiling Wide-Area Networks Using Peer Cooperation , 2005, IPTPS.

[5]  R. Holmes On random correlation matrices , 1991 .

[6]  Konstantina Papagiannaki,et al.  Structural analysis of network traffic flows , 2004, SIGMETRICS '04/Performance '04.

[7]  Albert G. Greenberg,et al.  Network anomography , 2005, IMC '05.

[8]  David Wetherall,et al.  Scriptroute: A Public Internet Measurement Facility , 2003, USENIX Symposium on Internet Technologies and Systems.

[9]  Ling Huang,et al.  Communication-Efficient Tracking of Distributed Cumulative Triggers , 2007, 27th International Conference on Distributed Computing Systems (ICDCS '07).

[10]  N. Alon,et al.  On the concentration of eigenvalues of random symmetric matrices , 2000, math-ph/0009032.

[11]  Michael K. Reiter,et al.  Seurat: A Pointillist Approach to Anomaly Detection , 2004, RAID.

[12]  Alan S. Willsky,et al.  Inference with Minimal Communication: a Decision-Theoretic Variational Approach , 2005, NIPS.

[13]  S. Geman A Limit Theorem for the Norm of Random Matrices , 1980 .

[14]  Christophe Diot,et al.  Diagnosing network-wide traffic anomalies , 2004, SIGCOMM.

[15]  David Wetherall,et al.  Scriptroute: a facility for distributed internet measurement , 2003 .

[16]  R. Y. Rubinstein Generating random vectors uniformly distributed inside and on the surface of different regions , 1982 .

[17]  Zlatko Drmac,et al.  On Principal Angles between Subspaces of Euclidean Space , 2000, SIAM J. Matrix Anal. Appl..

[18]  Albrecht Böttcher,et al.  The Norm of the Product of a Large Matrix and a Random Vector , 2003 .

[19]  Somesh Jha,et al.  Global Intrusion Detection in the DOMINO Overlay System , 2004, NDSS.

[20]  Graham Cormode,et al.  Communication-efficient distributed monitoring of thresholded counts , 2006, SIGMOD Conference.