Outlier Detection

Over the past decade, we have witnessed an enormous amount of research effort dedicated to the design of efficient outlier detection techniques while taking into consideration efficiency, accuracy, high-dimensional data, and distributed environments, among other factors. In this article, we present and examine these characteristics, current solutions, as well as open challenges and future research directions in identifying new outlier detection strategies. We propose a taxonomy of the recently designed outlier detection strategies while underlying their fundamental characteristics and properties. We also introduce several newly trending outlier detection methods designed for high-dimensional data, data streams, big data, and minimally labeled data. Last, we review their advantages and limitations and then discuss future and new challenging issues.

[1]  Yi Ma,et al.  Robust principal component analysis? , 2009, JACM.

[2]  Jugal K. Kalita,et al.  A Survey of Outlier Detection Methods in Network Anomaly Identification , 2011, Comput. J..

[3]  Sanjay Chawla,et al.  Finding Local Anomalies in Very High Dimensional Space , 2010, 2010 IEEE International Conference on Data Mining.

[4]  Ira Assent,et al.  Self-Adaptive Anytime Stream Clustering , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[5]  Claire Cardie,et al.  Proceedings of the Eighteenth International Conference on Machine Learning, 2001, p. 577–584. Constrained K-means Clustering with Background Knowledge , 2022 .

[6]  Charu C. Aggarwal,et al.  On the Surprising Behavior of Distance Metrics in High Dimensional Spaces , 2001, ICDT.

[7]  Charu C. Aggarwal,et al.  Outlier Detection with Autoencoder Ensembles , 2017, SDM.

[8]  Hans-Peter Kriegel,et al.  LoOP: local outlier probabilities , 2009, CIKM.

[9]  Hans-Peter Kriegel,et al.  Fast and Scalable Outlier Detection with Approximate Nearest Neighbor Ensembles , 2015, DASFAA.

[10]  Lei Cao,et al.  Pivot-Based Distributed K-Nearest Neighbor Mining , 2017, ECML/PKDD.

[11]  Christos Faloutsos,et al.  LOCI: fast outlier detection using the local correlation integral , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[12]  Aleksandar Lazarevic,et al.  Incremental Local Outlier Detection for Data Streams , 2007, 2007 IEEE Symposium on Computational Intelligence and Data Mining.

[13]  Mikhail J. Atallah,et al.  Detection of significant sets of episodes in event sequences , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[14]  Eamonn J. Keogh,et al.  Approximations to magic: finding unusual medical time series , 2005, 18th IEEE Symposium on Computer-Based Medical Systems (CBMS'05).

[15]  Feifei Li,et al.  Efficient parallel kNN joins for large data in MapReduce , 2012, EDBT '12.

[16]  LeckieChristopher,et al.  Fast Memory Efficient Local Outlier Detection in Data Streams , 2016 .

[17]  Lei Cao,et al.  Distributed Local Outlier Detection in Big Data , 2017, KDD.

[18]  F. E. Grubbs Procedures for Detecting Outlying Observations in Samples , 1969 .

[19]  Caroline Petitjean,et al.  One class random forests , 2013, Pattern Recognit..

[20]  FaloutsosChristos,et al.  The TV-tree , 1994, VLDB 1994.

[21]  Lawrence Carin,et al.  ALICE: Towards Understanding Adversarial Learning for Joint Distribution Matching , 2017, NIPS.

[22]  D. Hand,et al.  Unsupervised Profiling Methods for Fraud Detection , 2002 .

[23]  Barnabás Póczos,et al.  Nonparametric Divergence Estimation with Applications to Machine Learning on Distributions , 2011, UAI.

[24]  Kanishka Bhaduri,et al.  Algorithms for speeding up distance-based outlier detection , 2011, KDD.

[25]  T. Ferryman,et al.  Data outlier detection using the Chebyshev theorem , 2005, 2005 IEEE Aerospace Conference.

[26]  Erik Cambria,et al.  Recent Trends in Deep Learning Based Natural Language Processing , 2017, IEEE Comput. Intell. Mag..

[27]  Klemens Böhm,et al.  HiCS: High Contrast Subspaces for Density-Based Outlier Ranking , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[28]  Djamel Djenouri,et al.  A Survey on Urban Traffic Anomalies Detection Algorithms , 2019, IEEE Access.

[29]  Sean Hughes,et al.  Clustering by Fast Search and Find of Density Peaks , 2016 .

[30]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[31]  Ting Li,et al.  A locality-aware similar information searching scheme , 2014, International Journal on Digital Libraries.

[32]  Charu C. Aggarwal,et al.  Outlier Detection for Temporal Data: A Survey , 2014, IEEE Transactions on Knowledge and Data Engineering.

[33]  Ira Assent,et al.  AnyOut: Anytime Outlier Detection on Streaming Data , 2012, DASFAA.

[34]  Trevor Darrell,et al.  Adversarial Feature Learning , 2016, ICLR.

[35]  Marimuthu Palaniswami,et al.  Clustering ellipses for anomaly detection , 2011, Pattern Recognit..

[36]  Ling Chen,et al.  Learning Representations of Ultrahigh-dimensional Data for Random Distance-based Outlier Detection , 2018, KDD.

[37]  Fabrizio Angiulli,et al.  Detecting distance-based outliers in streams of data , 2007, CIKM '07.

[38]  Lei Cao,et al.  Scalable Top-n Local Outlier Detection , 2017, KDD.

[39]  Zengyou He,et al.  Discovering cluster-based local outliers , 2003, Pattern Recognit. Lett..

[40]  Charu C. Aggarwal,et al.  LODES: Local Density Meets Spectral Outlier Detection , 2016, SDM.

[41]  Hans-Peter Kriegel,et al.  Angle-based outlier detection in high-dimensional data , 2008, KDD.

[42]  Antonin Guttman,et al.  R-trees: a dynamic index structure for spatial searching , 1984, SIGMOD '84.

[43]  Vivekanand Gopalkrishnan,et al.  Efficient Pruning Schemes for Distance-Based Outlier Detection , 2009, ECML/PKDD.

[44]  Shai Shalev-Shwartz,et al.  Online Learning and Online Convex Optimization , 2012, Found. Trends Mach. Learn..

[45]  Georg Langs,et al.  Unsupervised Anomaly Detection with Generative Adversarial Networks to Guide Marker Discovery , 2017, IPMI.

[46]  Lei Cao,et al.  Distributed Top-N local outlier detection in big data , 2017, 2017 IEEE International Conference on Big Data (Big Data).

[47]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[48]  Ji Zhang,et al.  Detecting outlying subspaces for high-dimensional data: the new task, algorithms, and performance , 2006, Knowledge and Information Systems.

[49]  Kai Ming Ting,et al.  Efficient Anomaly Detection by Isolation Using Nearest Neighbour Ensemble , 2014, 2014 IEEE International Conference on Data Mining Workshop.

[50]  Xiaojin Zhu,et al.  Semi-Supervised Learning , 2010, Encyclopedia of Machine Learning.

[51]  Alessandro Panconesi,et al.  Concentration of Measure for the Analysis of Randomized Algorithms , 2009 .

[52]  Hans-Peter Kriegel,et al.  A survey on unsupervised outlier detection in high‐dimensional numerical data , 2012, Stat. Anal. Data Min..

[53]  Srinivasan Parthasarathy,et al.  Fast mining of distance-based outliers in high-dimensional datasets , 2008, Data Mining and Knowledge Discovery.

[54]  Damminda Alahakoon,et al.  Minority report in fraud detection: classification of skewed data , 2004, SKDD.

[55]  Mahsa Salehi,et al.  An Efficient Method for Anomaly Detection in Non-Stationary Data Streams , 2017, GLOBECOM 2017 - 2017 IEEE Global Communications Conference.

[56]  Seiichi Uchida,et al.  A Comparative Evaluation of Unsupervised Anomaly Detection Algorithms for Multivariate Data , 2016, PloS one.

[57]  Mahsa Salehi,et al.  Fast Memory Efficient Local Outlier Detection in Data Streams , 2017, IEEE Transactions on Knowledge and Data Engineering.

[58]  Anthony K. H. Tung,et al.  Ranking Outliers Using Symmetric Neighborhood Relationship , 2006, PAKDD.

[59]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[60]  Beng Chin Ooi,et al.  Efficient Processing of k Nearest Neighbor Joins using MapReduce , 2012, Proc. VLDB Endow..

[61]  Vincent Vercruyssen,et al.  Semi-Supervised Anomaly Detection with an Application to Water Analytics , 2018, 2018 IEEE International Conference on Data Mining (ICDM).

[62]  Dit-Yan Yeung,et al.  Parzen-window network intrusion detectors , 2002, Object recognition supported by user interaction for service robots.

[63]  Chuan Sheng Foo,et al.  Adversarially Learned Anomaly Detection , 2018, 2018 IEEE International Conference on Data Mining (ICDM).

[64]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[65]  Victoria J. Hodge,et al.  A Survey of Outlier Detection Methodologies , 2004, Artificial Intelligence Review.

[66]  Mahsa Salehi,et al.  A Survey on Anomaly detection in Evolving Data: [with Application to Forest Fire Risk Prediction] , 2018, SKDD.

[67]  Yannis Manolopoulos,et al.  Continuous monitoring of distance-based outliers over data streams , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[68]  D. Henderson,et al.  Experiencing Geometry: On Plane and Sphere , 1995 .

[69]  Miklos A. Vasarhelyi,et al.  Cluster Analysis for Anomaly Detection in Accounting Data: An Audit Approach 1 , 2011 .

[70]  Evaggelia Pitoura,et al.  Distributed In-Memory Processing of All k Nearest Neighbor Queries , 2016, IEEE Transactions on Knowledge and Data Engineering.

[71]  Jian Tang,et al.  Enhancing Effectiveness of Outlier Detections for Low Density Patterns , 2002, PAKDD.

[72]  Thomas G. Dietterich,et al.  Incorporating Expert Feedback into Active Anomaly Discovery , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[73]  Eamonn J. Keogh Nearest Neighbor , 2010, Encyclopedia of Machine Learning.

[74]  Zhi-Hua Zhou,et al.  Isolation Forest , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[75]  Kai Ming Ting,et al.  Defying the gravity of learning curve: a characteristic of nearest neighbour anomaly detectors , 2016, Machine Learning.

[76]  O. Chapelle,et al.  Semi-Supervised Learning (Chapelle, O. et al., Eds.; 2006) [Book reviews] , 2009, IEEE Transactions on Neural Networks.

[77]  Hwanjo Yu,et al.  DILOF: Effective and Memory Efficient Local Outlier Detection in Data Streams , 2018, KDD.

[78]  Chandan Srivastava,et al.  Support Vector Data Description , 2011 .

[79]  Nicole Immorlica,et al.  Locality-sensitive hashing scheme based on p-stable distributions , 2004, SCG '04.

[80]  V. Zolotarev One-dimensional stable distributions , 1986 .

[81]  Hongxing He,et al.  Outlier Detection Using Replicator Neural Networks , 2002, DaWaK.

[82]  Sridhar Ramaswamy,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD '00.

[83]  Ji Zhang,et al.  Advancements of Outlier Detection: A Survey , 2013, EAI Endorsed Trans. Scalable Inf. Syst..

[84]  Hongzhi Wang,et al.  Progress in Outlier Detection Techniques: A Survey , 2019, IEEE Access.

[85]  Dimitris Achlioptas,et al.  Database-friendly random projections , 2001, PODS.

[86]  Takehisa Yairi,et al.  An approach to spacecraft anomaly detection problem using kernel feature space , 2005, KDD '05.

[87]  M. Amer,et al.  Nearest-Neighbor and Clustering based Anomaly Detection Algorithms for RapidMiner , 2012 .

[88]  Clara Pizzuti,et al.  Fast Outlier Detection in High Dimensional Spaces , 2002, PKDD.

[89]  Shian-Shyong Tseng,et al.  Two-phase clustering process for outliers detection , 2001, Pattern Recognit. Lett..

[90]  Ken-ichi Iso Deep Learning in Speech Recognition , 2017 .

[91]  Christopher Leckie,et al.  An efficient hyperellipsoidal clustering algorithm for resource-constrained environments , 2011, Pattern Recognit..

[92]  Andrew W. Moore,et al.  Bayesian Network Anomaly Pattern Detection for Disease Outbreaks , 2003, ICML.

[93]  Gene H. Golub,et al.  Matrix computations , 1983 .

[94]  Mei Bai,et al.  An efficient algorithm for distributed density-based outlier detection on big data , 2016, Neurocomputing.

[95]  Ricard Gavaldà,et al.  Learning from Time-Changing Data with Adaptive Windowing , 2007, SDM.

[96]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[97]  Christos Faloutsos,et al.  The TV-tree: An index structure for high-dimensional data , 1994, The VLDB Journal.

[98]  W. B. Johnson,et al.  Extensions of Lipschitz mappings into Hilbert space , 1984 .

[99]  Daqiang Zhang,et al.  Novel clustering-based approach for Local Outlier Detection , 2016, 2016 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS).

[100]  Kaiming He,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[101]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[102]  Charu C. Aggarwal,et al.  Subspace Outlier Detection in Linear Time with Randomized Hashing , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[103]  Kun Li,et al.  Efficient Clustering-Based Outlier Detection Algorithm for Dynamic Data Stream , 2008, 2008 Fifth International Conference on Fuzzy Systems and Knowledge Discovery.

[104]  Clara Pizzuti,et al.  Distance-based detection and prediction of outliers , 2006, IEEE Transactions on Knowledge and Data Engineering.

[105]  Chris Jermaine,et al.  Outlier detection by sampling with accuracy guarantees , 2006, KDD '06.

[106]  Aaron C. Courville,et al.  Adversarially Learned Inference , 2016, ICLR.

[107]  Robert J. Brunner,et al.  Extended Isolation Forest , 2018, IEEE Transactions on Knowledge and Data Engineering.

[108]  Raymond T. Ng,et al.  Algorithms for Mining Distance-Based Outliers in Large Datasets , 1998, VLDB.

[109]  Randy C. Paffenroth,et al.  Anomaly Detection with Robust Deep Autoencoders , 2017, KDD.

[110]  Danai Koutra,et al.  Graph based anomaly detection and description: a survey , 2014, Data Mining and Knowledge Discovery.

[111]  Marius Kloft,et al.  Toward Supervised Anomaly Detection , 2014, J. Artif. Intell. Res..

[112]  T. H. Merrett,et al.  A class of data structures for associative searching , 1984, PODS.

[113]  Mikhail J. Atallah,et al.  Reliable detection of episodes in event sequences , 2004, Knowledge and Information Systems.

[114]  Cong Li,et al.  Robust Distributed Anomaly Detection Using Optimal Weighted One-Class Random Forests , 2018, 2018 IEEE International Conference on Data Mining (ICDM).

[115]  Thomas G. Dietterich,et al.  Feedback-Guided Anomaly Discovery via Online Optimization , 2018, KDD.

[116]  Li Tu,et al.  Density-based clustering for real-time stream data , 2007, KDD '07.

[117]  Stephen D. Bay,et al.  Mining distance-based outliers in near linear time with randomization and a simple pruning rule , 2003, KDD '03.

[118]  Henrik Boström,et al.  Reducing High-Dimensional Data by Principal Component Analysis vs. Random Projection for Nearest Neighbor Classification , 2006, 2006 5th International Conference on Machine Learning and Applications (ICMLA'06).

[119]  Rakesh Agarwal,et al.  Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.

[120]  Hans-Peter Kriegel,et al.  Outlier Detection in Axis-Parallel Subspaces of High Dimensional Data , 2009, PAKDD.

[121]  Anthony K. H. Tung,et al.  Mining top-n local outliers in large databases , 2001, KDD '01.

[122]  Stefan Berchtold,et al.  Efficient Biased Sampling for Approximate Clustering and Outlier Detection in Large Data Sets , 2003, IEEE Trans. Knowl. Data Eng..

[123]  Claudio Sartori,et al.  Distributed Strategies for Mining Outliers in Large Data Sets , 2013, IEEE Transactions on Knowledge and Data Engineering.

[124]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[125]  Gabriel Maciá-Fernández,et al.  Anomaly-based network intrusion detection: Techniques, systems and challenges , 2009, Comput. Secur..

[126]  Matthew O. Ward,et al.  Neighbor-based pattern detection for windows over streaming data , 2009, EDBT '09.

[127]  Karsten M. Borgwardt,et al.  Rapid Distance-Based Outlier Detection via Sampling , 2013, NIPS.

[128]  David G. Lowe,et al.  Scalable Nearest Neighbor Algorithms for High Dimensional Data , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[129]  Stephen P. Boyd,et al.  Accuracy at the Top , 2012, NIPS.

[130]  Vipin Kumar,et al.  Feature bagging for outlier detection , 2005, KDD '05.

[131]  Charles E. Heckler,et al.  Applied Multivariate Statistical Analysis , 2005, Technometrics.

[132]  Kai Ming Ting,et al.  LeSiNN: Detecting Anomalies by Identifying Least Similar Nearest Neighbours , 2015, 2015 IEEE International Conference on Data Mining Workshop (ICDMW).

[133]  SalehiMahsa,et al.  A Survey on Anomaly detection in Evolving Data , 2018 .

[134]  Mahsa Salehi,et al.  A Relevance Weighted Ensemble Model for Anomaly Detection in Switching Data Streams , 2014, PAKDD.

[135]  Yuichi Yoshida,et al.  Spectral Normalization for Generative Adversarial Networks , 2018, ICLR.

[136]  Thomas Seidl,et al.  Harnessing the strengths of anytime algorithms for constant data streams , 2009, Data Mining and Knowledge Discovery.

[137]  Shirish Tatikonda,et al.  Locality Sensitive Outlier Detection: A ranking driven approach , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[138]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[139]  Tomás Pevný,et al.  Loda: Lightweight on-line detector of anomalies , 2016, Machine Learning.

[140]  Charu C. Aggarwal,et al.  Outlier Analysis , 2013, Springer New York.

[141]  Lei Cao,et al.  Scalable distance-based outlier detection over high-volume data streams , 2014, 2014 IEEE 30th International Conference on Data Engineering.