A survey on unsupervised outlier detection in high‐dimensional numerical data

High-dimensional data in Euclidean space pose special challenges to data mining algorithms. These challenges are often indiscriminately subsumed under the term ‘curse of dimensionality’, more concrete aspects being the so-called ‘distance concentration effect’, the presence of irrelevant attributes concealing relevant information, or simply efficiency issues. In about just the last few years, the task of unsupervised outlier detection has found new specialized solutions for tackling high-dimensional data in Euclidean space. These approaches fall under mainly two categories, namely considering or not considering subspaces (subsets of attributes) for the definition of outliers. The former are specifically addressing the presence of irrelevant attributes, the latter do consider the presence of irrelevant attributes implicitly at best but are more concerned with general issues of efficiency and effectiveness. Nevertheless, both types of specialized outlier detection algorithms tackle challenges specific to high-dimensional data. In this survey article, we discuss some important aspects of the ‘curse of dimensionality’ in detail and survey specialized algorithms for outlier detection from both categories. © 2012 Wiley Periodicals, Inc. Statistical Analysis and Data Mining, 2012 © 2012 Wiley Periodicals, Inc.

[1]  Reda Alhajj,et al.  A comprehensive survey of numeric and symbolic outlier mining techniques , 2006, Intell. Data Anal..

[2]  Dimitris Achlioptas,et al.  Database-friendly random projections , 2001, PODS.

[3]  Christian Böhm,et al.  Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases , 2001, CSUR.

[4]  Ke Zhang,et al.  A New Local Distance-Based Outlier Detection Approach for Scattered Real-World Data , 2009, PAKDD.

[5]  Christos Faloutsos,et al.  Example-Based Outlier Detection for High Dimensional Datasets , 2005 .

[6]  Philip S. Yu,et al.  On High Dimensional Indexing of Uncertain Data , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[7]  Sanjay Chawla,et al.  SLOM: a new measure for local spatial outliers , 2006, Knowledge and Information Systems.

[8]  Hans-Peter Kriegel,et al.  Subspace and projected clustering: experimental evaluation and analysis , 2009, Knowledge and Information Systems.

[9]  Alexandr Andoni,et al.  Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[10]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[11]  J. Matousek,et al.  On variants of the Johnson–Lindenstrauss lemma , 2008 .

[12]  Christian Böhm,et al.  Independent quantization: an index compression technique for high-dimensional data spaces , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[13]  Hans-Peter Kriegel,et al.  Subspace Similarity Search: Efficient k-NN Queries in Arbitrary Subspaces , 2010, SSDBM.

[14]  Sridhar Ramaswamy,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD '00.

[15]  Elke Achtert,et al.  Visual Evaluation of Outlier Detection Models , 2010, DASFAA.

[16]  Hans-Peter Kriegel,et al.  Angle-based outlier detection in high-dimensional data , 2008, KDD.

[17]  Sameer Singh,et al.  Novelty detection: a review - part 1: statistical approaches , 2003, Signal Process..

[18]  Vipin Kumar,et al.  Feature bagging for outlier detection , 2005, KDD '05.

[19]  Huan Liu,et al.  Subspace clustering for high dimensional data: a review , 2004, SKDD.

[20]  Christian Böhm,et al.  Fast parallel similarity search in multimedia databases , 1997, SIGMOD '97.

[21]  Tok Wang Ling,et al.  HOS-Miner: A System for Detecting Outlying Subspaces of High-dimensional Data , 2004, VLDB.

[22]  Ira Assent,et al.  OutRank: ranking outliers in high dimensional data , 2008, 2008 IEEE 24th International Conference on Data Engineering Workshop.

[23]  Hans-Peter Kriegel,et al.  Subspace clustering , 2012, WIREs Data Mining Knowl. Discov..

[24]  J. Douglas Carroll,et al.  Is the Distance Compression Effect Overstated? Some Theory and Experimentation , 2009, MLDM.

[25]  Hans-Peter Kriegel,et al.  The R*-tree: an efficient and robust access method for points and rectangles , 1990, SIGMOD '90.

[26]  D. Hilbert Ueber die stetige Abbildung einer Line auf ein Flächenstück , 1891 .

[27]  Vivekanand Gopalkrishnan,et al.  Efficient Pruning Schemes for Distance-Based Outlier Detection , 2009, ECML/PKDD.

[28]  Ira Assent,et al.  Evaluating Clustering in Subspace Projections of High Dimensional Data , 2009, Proc. VLDB Endow..

[29]  Rasmus Pagh,et al.  A near-linear time approximation algorithm for angle-based outlier detection in high-dimensional data , 2012, KDD.

[30]  Shirish Tatikonda,et al.  Locality Sensitive Outlier Detection: A ranking driven approach , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[31]  Ira Assent,et al.  Subspace outlier mining in large multimedia databases , 2007, Parallel Universes and Local Patterns.

[32]  Sanjay Chawla,et al.  Finding Local Anomalies in Very High Dimensional Space , 2010, 2010 IEEE International Conference on Data Mining.

[33]  Hans-Peter Kriegel,et al.  The X-tree : An Index Structure for High-Dimensional Data , 2001, VLDB.

[34]  Wolfgang Müller,et al.  Faster Exact Histogram Intersection on Large Data Collections Using Inverted VA-Files , 2004, CIVR.

[35]  Raymond T. Ng,et al.  Distance-based outliers: algorithms and applications , 2000, The VLDB Journal.

[36]  Peter Filzmoser,et al.  Outlier identification in high dimensions , 2008, Comput. Stat. Data Anal..

[37]  Philip S. Yu,et al.  Redefining Clustering for High-Dimensional Applications , 2002, IEEE Trans. Knowl. Data Eng..

[38]  Charu C. Aggarwal,et al.  On the Surprising Behavior of Distance Metrics in High Dimensional Spaces , 2001, ICDT.

[39]  Ira Assent,et al.  EDSC: efficient density-based subspace clustering , 2008, CIKM '08.

[40]  A. Zimek,et al.  Deriving quantitative models for correlation clusters , 2006, KDD '06.

[41]  Osmar R. Zaïane,et al.  An Efficient Reference-Based Approach to Outlier Detection in Large Datasets , 2006, Sixth International Conference on Data Mining (ICDM'06).

[42]  Elke Achtert,et al.  Evaluation of Clusterings -- Metrics and Visual Support , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[43]  Victoria J. Hodge,et al.  A Survey of Outlier Detection Methodologies , 2004, Artificial Intelligence Review.

[44]  Michel Verleysen,et al.  Quality assessment of dimensionality reduction: Rank-based criteria , 2009, Neurocomputing.

[45]  Arnold P. Boedihardjo,et al.  GLS-SOD: a generalized local statistical approach for spatial outlier detection , 2010, KDD '10.

[46]  Michel Verleysen,et al.  The Curse of Dimensionality in Data Mining and Time Series Prediction , 2005, IWANN.

[47]  A. Zimek,et al.  Subspace Clustering, Ensemble Clustering, Alternative Clustering, Multiview Clustering: What Can We Learn From Each Other? , 2010 .

[48]  Klemens Böhm,et al.  HiCS: High Contrast Subspaces for Density-Based Outlier Ranking , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[49]  Hans-Peter Kriegel,et al.  Outlier Detection in Axis-Parallel Subspaces of High Dimensional Data , 2009, PAKDD.

[50]  Marios Hadjieleftheriou,et al.  R-Trees - A Dynamic Index Structure for Spatial Searching , 2008, ACM SIGSPATIAL International Workshop on Advances in Geographic Information Systems.

[51]  Clara Pizzuti,et al.  Fast Outlier Detection in High Dimensional Spaces , 2002, PKDD.

[52]  Hans-Peter Kriegel,et al.  Evaluation of Multiple Clustering Solutions , 2011, MultiClust@ECML/PKDD.

[53]  Xiang Lian,et al.  Similarity Search in Arbitrary Subspaces Under Lp-Norm , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[54]  Felix Naumann,et al.  Data fusion , 2009, CSUR.

[55]  G. Peano Sur une courbe, qui remplit toute une aire plane , 1890 .

[56]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[57]  Arthur Zimek,et al.  A survey on enhanced subspace clustering , 2013, Data Mining and Knowledge Discovery.

[58]  Shin'ichi Satoh,et al.  The SR-tree: an index structure for high-dimensional nearest neighbor queries , 1997, SIGMOD '97.

[59]  Jung-Min Park,et al.  An overview of anomaly detection techniques: Existing solutions and latest technological trends , 2007, Comput. Networks.

[60]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[61]  Emmanuel Müller,et al.  SOREX: Subspace Outlier Ranking Exploration Toolkit , 2010, ECML/PKDD.

[62]  Christos Faloutsos,et al.  Hilbert R-tree: An Improved R-tree using Fractals , 1994, VLDB.

[63]  Charu C. Aggarwal,et al.  Re-designing distance functions and distance-based applications for high dimensional data , 2001, SGMD.

[64]  Hans-Peter Kriegel,et al.  Quality of Similarity Rankings in Time Series , 2011, SSTD.

[65]  Beng Chin Ooi,et al.  An adaptive and efficient dimensionality reduction algorithm for high-dimensional indexing , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[66]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.

[67]  Ira Assent,et al.  Clustering high dimensional data , 2012 .

[68]  Elke Achtert,et al.  Global Correlation Clustering Based on the Hough Transform , 2008, Stat. Anal. Data Min..

[69]  Christos Faloutsos,et al.  Example-based robust outlier detection in high dimensional datasets , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[70]  Philip S. Yu,et al.  Finding generalized projected clusters in high dimensional spaces , 2000, SIGMOD '00.

[71]  Christos Faloutsos,et al.  The TV-tree: An index structure for high-dimensional data , 1994, The VLDB Journal.

[72]  W. B. Johnson,et al.  Extensions of Lipschitz mappings into Hilbert space , 1984 .

[73]  Ira Assent,et al.  An Unbiased Distance-Based Outlier Detection Approach for High-Dimensional Data , 2011, DASFAA.

[74]  Vivekanand Gopalkrishnan,et al.  Feature Extraction for Outlier Detection in High-Dimensional Spaces , 2010, FSDM.

[75]  Li Yang Distance‐preserving dimensionality reduction , 2011, Wiley Interdiscip. Rev. Data Min. Knowl. Discov..

[76]  Christian Böhm,et al.  A cost model for nearest neighbor search in high-dimensional data space , 1997, PODS.

[77]  Man Lung Yiu,et al.  Iterative projected clustering by subspace mining , 2005, IEEE Transactions on Knowledge and Data Engineering.

[78]  Ata Kabán,et al.  When is 'nearest neighbour' meaningful: A converse theorem and implications , 2009, J. Complex..

[79]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[80]  Hans-Peter Kriegel,et al.  LoOP: local outlier probabilities , 2009, CIKM.

[81]  Christos Faloutsos,et al.  On the 'Dimensionality Curse' and the 'Self-Similarity Blessing' , 2001, IEEE Trans. Knowl. Data Eng..

[82]  Philip S. Yu,et al.  Outlier detection for high dimensional data , 2001, SIGMOD '01.

[83]  Dunja Mladenic,et al.  The Role of Hubness in Clustering High-Dimensional Data , 2011, IEEE Transactions on Knowledge and Data Engineering.

[84]  J. S. Marron,et al.  Geometric representation of high dimension, low sample size data , 2005 .

[85]  Jing Gao,et al.  Converting Output Scores from Outlier Detection Algorithms into Probability Estimates , 2006, Sixth International Conference on Data Mining (ICDM'06).

[86]  Stephen D. Bay,et al.  Mining distance-based outliers in near linear time with randomization and a simple pruning rule , 2003, KDD '03.

[87]  Raymond T. Ng,et al.  Finding Intensional Knowledge of Distance-Based Outliers , 1999, VLDB.

[88]  Clara Pizzuti,et al.  Outlier mining in large high-dimensional data sets , 2005, IEEE Transactions on Knowledge and Data Engineering.

[89]  Hans-Peter Kriegel,et al.  Subspace similarity search using the ideas of ranking and top-k retrieval , 2010, 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010).

[90]  Raghu Ramakrishnan,et al.  Theory of nearest neighbors indexability , 2006, TODS.

[91]  Martin Ester,et al.  Density‐based clustering , 2019, WIREs Data Mining Knowl. Discov..

[92]  Bell Telephone,et al.  ROBUST ESTIMATES, RESIDUALS, AND OUTLIER DETECTION WITH MULTIRESPONSE DATA , 1972 .

[93]  Vladimir Pestov,et al.  On the geometry of similarity search: Dimensionality curse and concentration of measure , 1999, Inf. Process. Lett..

[94]  Alexandros Nanopoulos,et al.  Nearest neighbors in high-dimensional data: the emergence and influence of hubs , 2009, ICML '09.

[95]  Michel Verleysen,et al.  The Concentration of Fractional Distances , 2007, IEEE Transactions on Knowledge and Data Engineering.

[96]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[97]  Hans-Peter Kriegel,et al.  Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering , 2009, TKDD.

[98]  Anthony Wirth,et al.  Correlation Clustering , 2010, Encyclopedia of Machine Learning and Data Mining.

[99]  Emmanuel Müller,et al.  Adaptive outlierness for subspace outlier ranking , 2010, CIKM '10.

[100]  Anthony K. H. Tung,et al.  Ranking Outliers Using Symmetric Neighborhood Relationship , 2006, PAKDD.

[101]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[102]  Alexandros Nanopoulos,et al.  Time-Series Classification in Many Intrinsic Dimensions , 2010, SDM.

[103]  A. Zimek,et al.  BeyOND — Unleashing BOND , 2011 .

[104]  Anthony K. H. Tung,et al.  Mining top-n local outliers in large databases , 2001, KDD '01.

[105]  Stefan Berchtold,et al.  Efficient Biased Sampling for Approximate Clustering and Outlier Detection in Large Data Sets , 2003, IEEE Trans. Knowl. Data Eng..

[106]  Philip S. Yu,et al.  An effective and efficient algorithm for high-dimensional outlier detection , 2005, The VLDB Journal.

[107]  Douglas M. Hawkins Identification of Outliers , 1980, Monographs on Applied Probability and Statistics.

[108]  E. Gehan,et al.  The properties of high-dimensional data spaces: implications for exploring gene and protein expression data , 2008, Nature Reviews Cancer.

[109]  Kristin P. Bennett,et al.  Density-based indexing for approximate nearest-neighbor queries , 1999, KDD '99.

[110]  Ira Assent,et al.  DUSC: Dimensionality Unbiased Subspace Clustering , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[111]  Emmanuel Müller,et al.  Statistical selection of relevant subspace projections for outlier ranking , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[112]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[113]  V. A. Epanechnikov Non-Parametric Estimation of a Multivariate Probability Density , 1969 .

[114]  Jörg Sander,et al.  Finding non-redundant, statistically significant regions in high dimensional data: a novel approach to projected and subspace clustering , 2008, KDD.

[115]  Sergey Brin,et al.  Near Neighbor Search in Large Metric Spaces , 1995, VLDB.

[116]  Hans-Jörg Schek,et al.  A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces , 1998, VLDB.

[117]  Hans-Peter Kriegel,et al.  A General Framework for Increasing the Robustness of PCA-Based Correlation Clustering Algorithms , 2008, SSDBM.

[118]  Hans-Peter Kriegel,et al.  Efficient Query Processing in Arbitrary Subspaces Using Vector Approximations , 2006, 18th International Conference on Scientific and Statistical Database Management (SSDBM'06).

[119]  Srinivasan Parthasarathy,et al.  Fast mining of distance-based outliers in high-dimensional datasets , 2008, Data Mining and Knowledge Discovery.

[120]  Ata Kabán,et al.  On the distance concentration awareness of certain data reduction techniques , 2011, Pattern Recognit..

[121]  Sameer Singh,et al.  Novelty detection: a review - part 2: : neural network based approaches , 2003, Signal Process..

[122]  Ali S. Hadi,et al.  Detection of outliers , 2009 .

[123]  Xiaogang Su,et al.  Outlier detection , 2011, WIREs Data Mining Knowl. Discov..

[124]  Dimitris Achlioptas,et al.  Database-friendly random projections: Johnson-Lindenstrauss with binary coins , 2003, J. Comput. Syst. Sci..

[125]  A. Zimek,et al.  On Using Class-Labels in Evaluation of Clusterings , 2010 .

[126]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[127]  Hui Xiong,et al.  Distance metrics for high dimensional nearest neighborhood recovery: Compression and normalization , 2012, Inf. Sci..

[128]  Raymond T. Ng,et al.  A unified approach for mining outliers , 1997, CASCON.

[129]  Vipin Kumar,et al.  Anomaly Detection for Discrete Sequences: A Survey , 2012, IEEE Transactions on Knowledge and Data Engineering.

[130]  Alexandros Nanopoulos,et al.  Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data , 2010, J. Mach. Learn. Res..

[131]  Christos Faloutsos,et al.  Analysis of the Clustering Properties of the Hilbert Space-Filling Curve , 2001, IEEE Trans. Knowl. Data Eng..

[132]  Mario A. López,et al.  High dimensional similarity search with space filling curves , 2001, Proceedings 17th International Conference on Data Engineering.

[133]  Jian Tang,et al.  Enhancing Effectiveness of Outlier Detections for Low Density Patterns , 2002, PAKDD.

[134]  Hans-Peter Kriegel,et al.  Can Shared-Neighbor Distances Defeat the Curse of Dimensionality? , 2010, SSDBM.

[135]  Fionn Murtagh,et al.  The Remarkable Simplicity of Very High Dimensional Data: Application of Model-Based Clustering , 2008, J. Classif..

[136]  Srinivasan Parthasarathy,et al.  Distance-based outlier detection , 2010, Proc. VLDB Endow..

[137]  Myoung-Ho Kim,et al.  FINDIT: a fast and intelligent subspace clustering algorithm using dimension voting , 2004, Inf. Softw. Technol..

[138]  Jinyan Li,et al.  Distance Based Subspace Clustering with Flexible Dimension Partitioning , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[139]  R. Suganya,et al.  Data Mining Concepts and Techniques , 2010 .

[140]  Dimitrios Gunopulos,et al.  Subspace Clustering of High Dimensional Data , 2004, SDM.

[141]  Elke Achtert,et al.  Spatial Outlier Detection: Data, Algorithms, Visualizations , 2011, SSTD.

[142]  Christos Faloutsos,et al.  LOCI: fast outlier detection using the local correlation integral , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[143]  Philip S. Yu,et al.  Finding generalized projected clusters in high dimensional spaces , 2000, SIGMOD 2000.

[144]  Arthur Zimek,et al.  Subspace Clustering Techniques , 2009, Encyclopedia of Database Systems.

[145]  Hans-Peter Kriegel,et al.  On Evaluation of Outlier Rankings and Outlier Scores , 2012, SDM.

[146]  Suresh Venkatasubramanian,et al.  The Johnson-Lindenstrauss Transform: An Empirical Study , 2011, ALENEX.

[147]  Albert-László Barabási,et al.  Scale-Free Networks: A Decade and Beyond , 2009, Science.

[148]  Shin'ichi Satoh,et al.  Distinctiveness-sensitive nearest-neighbor search for efficient similarity retrieval of multimedia information , 2001, Proceedings 17th International Conference on Data Engineering.

[149]  Piotr Indyk,et al.  Nearest Neighbors in High-Dimensional Spaces , 2004, Handbook of Discrete and Computational Geometry, 2nd Ed..

[150]  Alexandros Nanopoulos,et al.  On the existence of obstinate results in vector space models , 2010, SIGIR.

[151]  Hans-Peter Kriegel,et al.  Interpreting and Unifying Outlier Scores , 2011, SDM.

[152]  Shashi Shekhar,et al.  A Unified Approach to Detecting Spatial Outliers , 2003, GeoInformatica.

[153]  Vivekanand Gopalkrishnan,et al.  Mining Outliers with Ensemble of Heterogeneous Detectors on Random Subspaces , 2010, DASFAA.

[154]  Edward Hung,et al.  Mining Outliers with Faster Cutoff Update and Space Utilization , 2009, PAKDD.