Outlier Detection in Non-stationary Data Streams

Continuous outlier detection in data streams is an important topic in data mining and has applications in various domains such as fraud detection, weather analysis, and intrusion detection. The non-stationary characteristic of real-world data streams brings the challenge of updating the outlier detection model in a timely and accurate manner. In this paper, we propose a framework for outlier detection in non-stationary data streams (O-NSD) which detects changes in the underlying data distribution to trigger a model update. We propose an improved distance function between sliding windows which offers a monotonicity property; we develop two accurate change detection algorithms, one of which is parameter-free; and we further propose new evaluation measures that quantify the timeliness of the detected changes. Our extensive experiments with real-world and synthetic datasets show that our change detection algorithms outperform the state-of-the-art solution. In addition, we demonstrate our O-NSD framework with two popular unsupervised outlier classifiers. Empirical results show that our framework offers higher accuracy and requires a much lower running time, compared to retrain-based and incremental update approaches.

[1]  Ludmila I. Kuncheva,et al.  Change Detection in Streaming Multivariate Data Using Likelihood Detectors , 2013, IEEE Transactions on Knowledge and Data Engineering.

[2]  Jeremiah D. Deng Online Outlier Detection of Energy Data Streams Using Incremental and Kernel PCA Algorithms , 2016, 2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW).

[3]  Sanjay Ranka,et al.  Statistical change detection for multi-dimensional data , 2007, KDD '07.

[4]  J. Ma,et al.  Time-series novelty detection using one-class support vector machines , 2003, Proceedings of the International Joint Conference on Neural Networks, 2003..

[5]  Xiangliang Zhang,et al.  A PCA-Based Change Detection Framework for Multidimensional Data Streams: Change Detection in Multidimensional Data Streams , 2015, KDD.

[6]  Michael L. Fredman,et al.  On computing the length of longest increasing subsequences , 1975, Discret. Math..

[7]  Shai Ben-David,et al.  Detecting Change in Data Streams , 2004, VLDB.

[8]  Cristina Verde,et al.  Comments on the applicability of “An improved weighted recursive PCA algorithm for adaptive fault detection” , 2017 .

[9]  D. Freedman,et al.  On the histogram as a density estimator:L2 theory , 1981 .

[10]  A. Kouadri,et al.  A new adaptive PCA based thresholding scheme for fault detection in complex systems , 2017 .

[11]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[12]  Hongliang Fei,et al.  Anomaly localization for network data streams with graph joint sparse PCA , 2011, KDD.

[13]  S. Salzberg,et al.  Alignment of whole genomes. , 1999, Nucleic acids research.

[14]  Matthew Brand,et al.  Incremental Singular Value Decomposition of Uncertain Data with Missing Values , 2002, ECCV.

[15]  Jason Weston,et al.  Multi-Class Support Vector Machines , 1998 .

[16]  Salvatore J. Stolfo,et al.  One Class Support Vector Machines for Detecting Anomalous Windows Registry Accesses , 2003 .

[17]  Wenke Lee,et al.  McPAD: A multiple classifier system for accurate payload-based anomaly detection , 2009, Comput. Networks.

[18]  Takehisa Yairi,et al.  An approach to spacecraft anomaly detection problem using kernel feature space , 2005, KDD '05.

[19]  Cyrus Shahabi,et al.  Distance-based Outlier Detection in Data Streams , 2016, Proc. VLDB Endow..

[20]  L. Schmetterer Zeitschrift fur Wahrscheinlichkeitstheorie und Verwandte Gebiete. , 1963 .

[21]  L. J. Mangum,et al.  TOGA-TAO: A Moored Array for Real-time Measurements in the Tropical Pacific Ocean , 1991 .

[22]  Victor Ciesielski,et al.  Anomaly Detection Using Replicator Neural Networks Trained on Examples of One Class , 2014, SEAL.

[23]  João Gama,et al.  A Study on Change Detection Methods , 2009 .

[24]  João Gama,et al.  A survey on concept drift adaptation , 2014, ACM Comput. Surv..

[25]  Marimuthu Palaniswami,et al.  Centered Hyperspherical and Hyperellipsoidal One-Class Support Vector Machines for Anomaly Detection in Sensor Networks , 2010, IEEE Transactions on Information Forensics and Security.

[26]  Hassan A. Karimi,et al.  INCREMENTAL PRINCIPAL COMPONENT ANALYSIS BASED OUTLIER DETECTION METHODS FOR SPATIOTEMPORAL DATA STREAMS , 2015 .

[27]  Rua-Huan Tsaih,et al.  Outlier detection in the concept drifting environment , 2016, 2016 International Joint Conference on Neural Networks (IJCNN).

[28]  Taposh Banerjee,et al.  Data-Efficient Quickest Change Detection in Minimax Settings , 2013, IEEE Transactions on Information Theory.

[29]  Ricard Gavaldà,et al.  Learning from Time-Changing Data with Adaptive Windowing , 2007, SDM.

[30]  H. Mouss,et al.  Test of Page-Hinckley, an approach for fault detection in an agro-alimentary production system , 2004, 2004 5th Asian Control Conference (IEEE Cat. No.04EX904).

[31]  Guofei Gu,et al.  Using an Ensemble of One-Class SVM Classifiers to Harden Payload-based Anomaly Detection Systems , 2006, Sixth International Conference on Data Mining (ICDM'06).

[32]  Marina Thottan,et al.  Anomaly detection in IP networks , 2003, IEEE Trans. Signal Process..

[33]  Ming-Hsuan Yang,et al.  Incremental Learning for Robust Visual Tracking , 2008, International Journal of Computer Vision.

[34]  S. Venkatasubramanian,et al.  An Information-Theoretic Approach to Detecting Changes in Multi-Dimensional Data Streams , 2006 .

[35]  J. Baik,et al.  On the distribution of the length of the longest increasing subsequence of random permutations , 1998, math/9810105.

[36]  Hugo Vieira Neto,et al.  Incremental PCA: an alternative approach for novelty detection , 2005 .

[37]  D. Romik The Surprising Mathematics of Longest Increasing Subsequences , 2015 .

[38]  Taposh Banerjee,et al.  Quickest Change Detection , 2012, ArXiv.

[39]  Joni da Silva Fraga,et al.  Octopus-IIDS: An anomaly based intelligent intrusion detection system , 2010, The IEEE symposium on Computers and Communications.

[40]  Solomon Kullback,et al.  Information Theory and Statistics , 1960 .

[41]  Michel Verleysen,et al.  Improving the Robustness to Outliers of Mixtures of Probabilistic PCAs , 2008, PAKDD.

[42]  Charu C. Aggarwal,et al.  Outlier Analysis , 2013, Springer New York.

[43]  Herbert A. Sturges,et al.  The Choice of a Class Interval , 1926 .

[44]  R. Lasaponara On the use of principal component analysis (PCA) for evaluating interannual vegetation anomalies from SPOT/VEGETATION NDVI temporal series , 2006 .

[45]  Nirvana Meratnia,et al.  Ensuring high sensor data quality through use of online outlier detection techniques , 2010, Int. J. Sens. Networks.

[46]  Nathan Srebro,et al.  Stochastic optimization for PCA and PLS , 2012, 2012 50th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[47]  Michael E. Tipping,et al.  Probabilistic Principal Component Analysis , 1999 .