Scalable KDE-based top-n local outlier detection over large-scale data streams

Abstract The detection of local outliers over high-volume data streams is critical for diverse real-time applications in the real world, where the distributions in different subsets of the data tend to be skewed. However, existing methods are not scalable to large-scale high-volume data streams owing to the high complexity of the re-detection of data updates. In this work, we propose a top- n local outlier detection method based on Kernel Density Estimation (KDE) over large-scale high-volume data streams. First, we define a KDE-based Outlier Factor (KOF) to measure the local outlierness score for the data points. Then, we propose the upper bounds of the KOF and an upper-bound-based pruning strategy to quickly eliminate the majority of the inlier points by leveraging the upper bounds without computing the expensive KOF scores. Moreover, we design an U pper-bound pruning-based top- n KOF detection method (UKOF) over data streams to efficiently address the data updates in a sliding window environment. Furthermore, we propose a L azy update method of UKOF (LUKOF) for bulk updates in high-speed large-scale data streams to further minimize the computation cost. Our comprehensive experimental study demonstrates that the proposed method outperforms the state-of-the-art methods by up to 3,600 times in speed, while achieving the best performance in detecting local outliers over data streams.

[1]  Sanjay Chawla,et al.  On local spatial outliers , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[2]  Christos Faloutsos,et al.  LOCI: fast outlier detection using the local correlation integral , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[3]  Jian Tang,et al.  Enhancing Effectiveness of Outlier Detections for Low Density Patterns , 2002, PAKDD.

[4]  Jing Lin,et al.  Adaptive kernel density-based anomaly detection for nonlinear systems , 2018, Knowl. Based Syst..

[5]  Xiaoqin Zhang,et al.  RKOF: Robust Kernel-Based Local Outlier Detection , 2011, PAKDD.

[6]  Ke Zhang,et al.  A New Local Distance-Based Outlier Detection Approach for Scattered Real-World Data , 2009, PAKDD.

[7]  Lei Cao,et al.  Detecting moving object outliers in massive-scale trajectory streams , 2014, KDD.

[8]  Hongxing He,et al.  A comparative study of RNN for outlier detection in data mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[9]  Denis J. Dean,et al.  Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables , 1999 .

[10]  Kai Ming Ting,et al.  Fast Anomaly Detection for Streaming Data , 2011, IJCAI.

[11]  Aleksandar Lazarevic,et al.  Outlier Detection with Kernel Density Functions , 2007, MLDM.

[12]  Aleksandar Lazarevic,et al.  Incremental Local Outlier Detection for Data Streams , 2007, 2007 IEEE Symposium on Computational Intelligence and Data Mining.

[13]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD 2000.

[14]  Yannis Manolopoulos,et al.  Efficient and flexible algorithms for monitoring distance-based outliers over data streams , 2016, Inf. Syst..

[15]  Charu C. Aggarwal,et al.  Outlier Analysis , 2013, Springer New York.

[16]  Lei Cao,et al.  Scalable distance-based outlier detection over high-volume data streams , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[17]  Stephen D. Bay,et al.  Mining distance-based outliers in near linear time with randomization and a simple pruning rule , 2003, KDD '03.

[18]  Philip S. Yu,et al.  Outlier Detection with Uncertain Data , 2008, SDM.

[19]  Mahsa Salehi,et al.  Fast Memory Efficient Local Outlier Detection in Data Streams , 2017, IEEE Transactions on Knowledge and Data Engineering.

[20]  Anthony K. H. Tung,et al.  Ranking Outliers Using Symmetric Neighborhood Relationship , 2006, PAKDD.

[21]  Hamed R. Bonab,et al.  Unsupervised Concept Drift Detection with a Discriminative Classifier , 2019, CIKM.

[22]  Sridhar Ramaswamy,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD '00.

[23]  Anil K. Jain Data clustering: 50 years beyond K-means , 2008, Pattern Recognit. Lett..

[24]  Anthony K. H. Tung,et al.  Mining top-n local outliers in large databases , 2001, KDD '01.

[25]  Heiko Wersing,et al.  KNN Classifier with Self Adjusting Memory for Heterogeneous Concept Drift , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[26]  Haibo He,et al.  A local density-based approach for outlier detection , 2017, Neurocomputing.

[27]  Maurizio Filippone,et al.  A comparative evaluation of outlier detection algorithms: Experiments and analyses , 2018, Pattern Recognit..

[28]  Arthur Zimek,et al.  On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study , 2016, Data Mining and Knowledge Discovery.

[29]  Geng Zhao,et al.  A Parameter Space Framework for Online Outlier Detection Over High-Volume Data Streams , 2018, IEEE Access.

[30]  Hans-Peter Kriegel,et al.  Local outlier detection reconsidered: a generalized view on locality with applications to spatial, video, and network outlier detection , 2012, Data Mining and Knowledge Discovery.

[31]  Hans-Peter Kriegel,et al.  Generalized Outlier Detection with Flexible Kernel Density Estimates , 2014, SDM.

[32]  Yannis Manolopoulos,et al.  Continuous monitoring of distance-based outliers over data streams , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[33]  Charu C. Aggarwal,et al.  Outlier Detection for Temporal Data , 2014, Outlier Detection for Temporal Data.

[34]  Xing Xie,et al.  GeoLife: A Collaborative Social Networking Service among User, Location and Trajectory , 2010, IEEE Data Eng. Bull..

[35]  Jennifer Widom,et al.  The CQL continuous query language: semantic foundations and query execution , 2006, The VLDB Journal.

[36]  M. Hubert,et al.  Outlier detection for skewed data , 2008 .

[37]  Raymond T. Ng,et al.  Distance-based outliers: algorithms and applications , 2000, The VLDB Journal.

[38]  Clara Pizzuti,et al.  Fast Outlier Detection in High Dimensional Spaces , 2002, PKDD.

[39]  Zhi-Hua Zhou,et al.  Isolation Forest , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[40]  Junping Du,et al.  Anomaly Detection Using Local Kernel Density Estimation and Context-Based Regression , 2020, IEEE Transactions on Knowledge and Data Engineering.

[41]  Charu C. Aggarwal,et al.  Outlier Detection for Temporal Data: A Survey , 2014, IEEE Transactions on Knowledge and Data Engineering.

[42]  Hwanjo Yu,et al.  DILOF: Effective and Memory Efficient Local Outlier Detection in Data Streams , 2018, KDD.

[43]  Kun Li,et al.  Efficient Clustering-Based Outlier Detection Algorithm for Dynamic Data Stream , 2008, 2008 Fifth International Conference on Fuzzy Systems and Knowledge Discovery.

[44]  Mineichi Kudo,et al.  Multidimensional curve classification using passing-through regions , 1999, Pattern Recognit. Lett..

[45]  Raymond T. Ng,et al.  Algorithms for Mining Distance-Based Outliers in Large Datasets , 1998, VLDB.

[46]  Charu C. Aggarwal,et al.  Theoretical Foundations and Algorithms for Outlier Ensembles , 2015, SKDD.

[47]  Zengyou He,et al.  Discovering cluster-based local outliers , 2003, Pattern Recognit. Lett..

[48]  Charu C. Aggarwal,et al.  LODES: Local Density Meets Spectral Outlier Detection , 2016, SDM.

[49]  M. Pavlidou,et al.  Kernel Density Outlier Detector , 2014 .

[50]  H. E. Solberg,et al.  Detection of outliers in reference distributions: performance of Horn's algorithm. , 2005, Clinical chemistry.