Efficient Anomaly Detection via Matrix Sketching

We consider the problem of finding anomalies in high-dimensional data using popular PCA based anomaly scores. The naive algorithms for computing these scores explicitly compute the PCA of the covariance matrix which uses space quadratic in the dimensionality of the data. We give the first streaming algorithms that use space that is linear or sublinear in the dimension. We prove general results showing that \emph{any} sketch of a matrix that satisfies a certain operator norm guarantee can be used to approximate these scores. We instantiate these results with powerful matrix sketching techniques such as Frequent Directions and random projections to derive efficient and practical algorithms for these problems, which we validate over real-world data sets. Our main technical contribution is to prove matrix perturbation inequalities for operators arising in the computation of these measures.

[1]  Michael W. Mahoney,et al.  Fast Randomized Kernel Ridge Regression with Statistical Guarantees , 2015, NIPS.

[2]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[3]  Markus Schneider,et al.  Expected similarity estimation for large-scale batch and streaming anomaly detection , 2016, Machine Learning.

[4]  David P. Woodruff,et al.  Frequent Directions: Simple and Deterministic Matrix Sketching , 2015, SIAM J. Comput..

[5]  Salil P. Vadhan,et al.  Pseudorandomness , 2012, Found. Trends Theor. Comput. Sci..

[6]  David P. Woodruff Sketching as a Tool for Numerical Linear Algebra , 2014, Found. Trends Theor. Comput. Sci..

[7]  Petros Drineas,et al.  Fast Monte Carlo Algorithms for Matrices I: Approximating Matrix Multiplication , 2006, SIAM J. Comput..

[8]  Avner Magen,et al.  Low rank matrix-valued chernoff bounds and approximate matrix multiplication , 2010, SODA '11.

[9]  Michael W. Mahoney Randomized Algorithms for Matrices and Data , 2011, Found. Trends Mach. Learn..

[10]  Michael B. Cohen,et al.  Dimensionality Reduction for k-Means Clustering and Low Rank Approximation , 2014, STOC.

[11]  David P. Woodruff,et al.  Fast approximation of matrix coherence and statistical leverage , 2011, ICML.

[12]  Hao Huang,et al.  Streaming Anomaly Detection Using Randomized Matrix Sketching , 2015, Proc. VLDB Endow..

[13]  Ling Huang,et al.  Communication-Efficient Online Detection of Network-Wide Anomalies , 2007, IEEE INFOCOM 2007 - 26th IEEE International Conference on Computer Communications.

[14]  Krishna P. Gummadi,et al.  Towards Detecting Anomalous User Behavior in Online Social Networks , 2014, USENIX Security Symposium.

[15]  Huaimin Wang,et al.  Toward Fine-Grained, Unsupervised, Scalable Performance Diagnosis for Production Cloud Computing Systems , 2013, IEEE Transactions on Parallel and Distributed Systems.

[16]  Michael B. Cohen,et al.  Input Sparsity Time Low-rank Approximation via Ridge Leverage Score Sampling , 2015, SODA.

[17]  P. Wedin Perturbation bounds in connection with singular value decomposition , 1972 .

[18]  Charu C. Aggarwal,et al.  Outlier Analysis , 2013, Springer New York.

[19]  Rebecca S. Portnoff The Dark Net: De-Anonymization, Classification and Analysis , 2018 .

[20]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[21]  Bernard Chazelle,et al.  The Fast Johnson--Lindenstrauss Transform and Approximate Nearest Neighbors , 2009, SIAM J. Comput..

[22]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[23]  Ling Huang,et al.  In-Network PCA and Anomaly Detection , 2006, NIPS.

[24]  S. Joe Qin,et al.  Statistical process monitoring: basics and beyond , 2003 .

[25]  Richard D. Braatz,et al.  Fault detection in industrial processes using canonical variate analysis and dynamic principal component analysis , 2000 .

[26]  H. Holgersson,et al.  Three estimators of the Mahalanobis distance in high-dimensional data , 2012 .

[27]  Pierre Baldi,et al.  Functional census of mutation sequence spaces: the example of p53 cancer rescue mutants , 2006, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[28]  David P. Woodruff,et al.  Optimal Approximate Matrix Product in Terms of Stable Rank , 2015, ICALP.

[29]  Andrew J. Clark,et al.  Data preprocessing for anomaly based network intrusion detection: A review , 2011, Comput. Secur..

[30]  Mark Crovella,et al.  Diagnosing network-wide traffic anomalies , 2004, SIGCOMM '04.

[31]  V. Koltchinskii,et al.  Concentration inequalities and moment bounds for sample covariance operators , 2014, 1405.2468.

[32]  Xiangliang Zhang,et al.  A Novel Intrusion Detection Method Based on Principle Component Analysis in Computer Security , 2004, ISNN.

[33]  M. Shyu,et al.  A Novel Anomaly Detection Scheme Based on Principal Component Classifier , 2003 .

[34]  Nathan Halko,et al.  Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions , 2009, SIAM Rev..

[35]  Charles R. Johnson,et al.  Topics in Matrix Analysis , 1991 .

[36]  Subhash Khot,et al.  Near-optimal lower bounds on the multi-party communication complexity of set disjointness , 2003, 18th IEEE Annual Conference on Computational Complexity, 2003. Proceedings..

[37]  Richard D. Braatz,et al.  Fault Detection and Diagnosis in Industrial Systems , 2001 .

[38]  Mark Crovella,et al.  Mining anomalies using traffic feature distributions , 2005, SIGCOMM '05.

[39]  Edo Liberty,et al.  Simple and deterministic matrix sketching , 2012, KDD.

[40]  David P. Woodruff,et al.  Low rank approximation and regression in input sparsity time , 2012, STOC '13.

[41]  Michael I. Jordan,et al.  Detecting large-scale system problems by mining console logs , 2009, SOSP '09.

[42]  Zhi-Hua Zhou,et al.  Isolation Forest , 2008, 2008 Eighth IEEE International Conference on Data Mining.