Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection

An integrated framework for density-based cluster analysis, outlier detection, and data visualization is introduced in this article. The main module consists of an algorithm to compute hierarchical estimates of the level sets of a density, following Hartigan’s classic model of density-contour clusters and trees. Such an algorithm generalizes and improves existing density-based clustering techniques with respect to different aspects. It provides as a result a complete clustering hierarchy composed of all possible density-based clusters following the nonparametric model adopted, for an infinite range of density thresholds. The resulting hierarchy can be easily processed so as to provide multiple ways for data visualization and exploration. It can also be further postprocessed so that: (i) a normalized score of “outlierness” can be assigned to each data object, which unifies both the global and local perspectives of outliers into a single definition; and (ii) a “flat” (i.e., nonhierarchical) clustering solution composed of clusters extracted from local cuts through the cluster tree (possibly corresponding to different density thresholds) can be obtained, either in an unsupervised or in a semisupervised way. In the unsupervised scenario, the algorithm corresponding to this postprocessing module provides a global, optimal solution to the formal problem of maximizing the overall stability of the extracted clusters. If partially labeled objects or instance-level constraints are provided by the user, the algorithm can solve the problem by considering both constraints violations/satisfactions and cluster stability criteria. An asymptotic complexity analysis, both in terms of running time and memory space, is described. Experiments are reported that involve a variety of synthetic and real datasets, including comparisons with state-of-the-art, density-based clustering and (global and local) outlier detection methods.

[1]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[2]  Daniel A. Keim,et al.  A General Approach to Clustering in Large Databases with Noise , 2003, Knowledge and Information Systems.

[3]  Sanjay Chawla,et al.  Finding Local Anomalies in Very High Dimensional Space , 2010, 2010 IEEE International Conference on Data Mining.

[4]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[5]  Arthur Zimek,et al.  Subsampling for efficient and effective unsupervised outlier detection ensembles , 2013, KDD.

[6]  Vipin Kumar,et al.  Introduction to Data Mining, (First Edition) , 2005 .

[7]  Tom Lane,et al.  A Kth Nearest Neighbour Clustering Procedure , 2015 .

[8]  G. Sawitzki,et al.  Excess Mass Estimates and Tests for Multimodality , 1991 .

[9]  Sanjay Chawla,et al.  Density-preserving projections for large-scale local anomaly detection , 2012, Knowledge and Information Systems.

[10]  Christos Faloutsos,et al.  LOCI: fast outlier detection using the local correlation integral , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[11]  Martin Ester,et al.  Density‐based clustering , 2019, WIREs Data Mining Knowl. Discov..

[12]  Daniel T. Larose,et al.  Discovering Knowledge in Data: An Introduction to Data Mining , 2005 .

[13]  Ira Assent,et al.  Explaining Outliers by Subspace Separability , 2013, 2013 IEEE 13th International Conference on Data Mining.

[14]  W. Stuetzle,et al.  A Generalized Single Linkage Method for Estimating the Cluster Tree of a Density , 2010 .

[15]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[16]  Hans-Peter Kriegel,et al.  Data bubbles: quality preserving performance boosting for hierarchical clustering , 2001, SIGMOD '01.

[17]  S. C. Johnson Hierarchical clustering schemes , 1967, Psychometrika.

[18]  Klemens Böhm,et al.  HiCS: High Contrast Subspaces for Density-Based Outlier Ranking , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[19]  Philip S. Yu,et al.  Outlier detection for high dimensional data , 2001, SIGMOD '01.

[20]  Raymond T. Ng,et al.  Algorithms for Mining Distance-Based Outliers in Large Datasets , 1998, VLDB.

[21]  Arthur Zimek,et al.  A framework for semi-supervised and unsupervised optimal extraction of clusters from hierarchies , 2013, Data Mining and Knowledge Discovery.

[22]  Vincent Kanade,et al.  Clustering Algorithms , 2021, Wireless RF Energy Transfer in the Massive IoT Era.

[23]  Hans-Peter Kriegel,et al.  Outlier Detection in Arbitrarily Oriented Subspaces , 2012, 2012 IEEE 12th International Conference on Data Mining.

[24]  Jing Gao,et al.  Converting Output Scores from Outlier Detection Algorithms into Probability Estimates , 2006, Sixth International Conference on Data Mining (ICDM'06).

[25]  Stephen D. Bay,et al.  Mining distance-based outliers in near linear time with randomization and a simple pruning rule , 2003, KDD '03.

[26]  Myra Spiliopoulou,et al.  C-DBSCAN: Density-Based Clustering with Constraints , 2009, RSFDGrC.

[27]  Ira Assent,et al.  An Unbiased Distance-Based Outlier Detection Approach for High-Dimensional Data , 2011, DASFAA.

[28]  Hans-Peter Kriegel,et al.  Visually Mining through Cluster Hierarchies , 2004, SDM.

[29]  Christian Böhm,et al.  HISSCLU: a hierarchical density-based method for semi-supervised clustering , 2008, EDBT '08.

[30]  J. Hanley,et al.  The meaning and use of the area under a receiver operating characteristic (ROC) curve. , 1982, Radiology.

[31]  J. Hartigan Estimation of a Convex Density Contour in Two Dimensions , 1987 .

[32]  W. R. Buckland,et al.  Outliers in Statistical Data , 1979 .

[33]  Arnold W. M. Smeulders,et al.  The Amsterdam Library of Object Images , 2004, International Journal of Computer Vision.

[34]  Vipin Kumar,et al.  Finding Clusters of Different Sizes, Shapes, and Densities in Noisy, High Dimensional Data , 2003, SDM.

[35]  Hans-Peter Kriegel,et al.  On Evaluation of Outlier Rankings and Outlier Scores , 2012, SDM.

[36]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[37]  Yiyu Yao,et al.  Local peculiarity factor and its application in outlier detection , 2008, KDD.

[38]  Bianca Zadrozny,et al.  Outlier detection by active learning , 2006, KDD '06.

[39]  Jenq-Neng Hwang,et al.  Nonparametric multivariate density estimation: a comparative study , 1994, IEEE Trans. Signal Process..

[40]  Ana L. N. Fred,et al.  Combining multiple clusterings using evidence accumulation , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[41]  Ian Davidson,et al.  Constrained Clustering: Advances in Algorithms, Theory, and Applications , 2008 .

[42]  Ali S. Hadi,et al.  Detection of outliers , 2009 .

[43]  D. Massart,et al.  Looking for natural patterns in data: Part 1. Density-based approach , 2001 .

[44]  Vic Barnett,et al.  The Study of Outliers: Purpose and Model , 1978 .

[45]  Osmar R. Zaïane,et al.  A parameterless method for efficiently discovering clusters of arbitrary shape in large datasets , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[46]  Josef Schmee,et al.  Outliers in Statistical Data (2nd ed.) , 1986 .

[47]  P. J. Green,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[48]  Claire Cardie,et al.  Intelligent Clustering with Instance-Level Constraints , 2002 .

[49]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[50]  Emmanuel Müller,et al.  Adaptive outlierness for subspace outlier ranking , 2010, CIKM '10.

[51]  Anthony K. H. Tung,et al.  Ranking Outliers Using Symmetric Neighborhood Relationship , 2006, PAKDD.

[52]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[53]  Hans-Peter Kriegel,et al.  Outlier Detection in Axis-Parallel Subspaces of High Dimensional Data , 2009, PAKDD.

[54]  Raymond T. Ng,et al.  A Unified Notion of Outliers: Properties and Computation , 1997, KDD.

[55]  Peng Liu,et al.  VDBSCAN: Varied Density Based Spatial Clustering of Applications with Noise , 2007, 2007 International Conference on Service Systems and Service Management.

[56]  Dorin Comaniciu,et al.  Distribution Free Decomposition of Multivariate Data , 1998, Pattern Analysis & Applications.

[57]  Raymond T. Ng,et al.  Distance-based outliers: algorithms and applications , 2000, The VLDB Journal.

[58]  Adrian E. Raftery,et al.  Model-based clustering and data transformations for gene expression data , 2001, Bioinform..

[59]  Reda Alhajj,et al.  A comprehensive survey of numeric and symbolic outlier mining techniques , 2006, Intell. Data Anal..

[60]  Filiberto Pla,et al.  Non Parametric Local Density-Based Clustering for Multimodal Overlapping Distributions , 2006, IDEAL.

[61]  Daniel A. Keim,et al.  An Efficient Approach to Clustering in Large Multimedia Databases with Noise , 1998, KDD.

[62]  Ke Zhang,et al.  A New Local Distance-Based Outlier Detection Approach for Scattered Real-World Data , 2009, PAKDD.

[63]  A. Cuevas,et al.  Cluster analysis: a further approach based on density estimation , 2001 .

[64]  Werner Stuetzle,et al.  Estimating the Cluster Tree of a Density by Analyzing the Minimal Spanning Tree of a Sample , 2003, J. Classif..

[65]  Chinatsu Aone,et al.  Fast and effective text mining using linear-time document clustering , 1999, KDD '99.

[66]  Fabrizio Angiulli,et al.  DOLPHIN: An efficient algorithm for mining distance-based outliers in very large datasets , 2009, TKDD.

[67]  Ira Assent,et al.  OutRank: ranking outliers in high dimensional data , 2008, 2008 IEEE 24th International Conference on Data Engineering Workshop.

[68]  Rasmus Pagh,et al.  A near-linear time approximation algorithm for angle-based outlier detection in high-dimensional data , 2012, KDD.

[69]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[70]  Elke Achtert,et al.  Interactive data mining with 3D-parallel-coordinate-trees , 2013, SIGMOD '13.

[71]  Vipin Kumar,et al.  Feature bagging for outlier detection , 2005, KDD '05.

[72]  Desire L. Massart,et al.  Potential methods in pattern recognition : Part 2. CLUPOT —an unsupervised pattern recognition technique , 1981 .

[73]  A. Madansky Identification of Outliers , 1988 .

[74]  P. Sneath,et al.  Some thoughts on bacterial classification. , 1957, Journal of general microbiology.

[75]  Shizuhiko Nishisato,et al.  Elements of Dual Scaling: An Introduction To Practical Data Analysis , 1993 .

[76]  Guoyin Wang,et al.  Rough Sets, Fuzzy Sets, Data Mining, and Granular Computing , 2013, Lecture Notes in Computer Science.

[77]  Myra Spiliopoulou,et al.  Density-based semi-supervised clustering , 2010, Data Mining and Knowledge Discovery.

[78]  Hans-Peter Kriegel,et al.  Interpreting and Unifying Outlier Scores , 2011, SDM.

[79]  Vivekanand Gopalkrishnan,et al.  Mining Outliers with Ensemble of Heterogeneous Detectors on Random Subspaces , 2010, DASFAA.

[80]  Hans-Peter Kriegel,et al.  Angle-based outlier detection in high-dimensional data , 2008, KDD.

[81]  Arthur Zimek,et al.  Ensembles for unsupervised outlier detection: challenges and research questions a position paper , 2014, SKDD.

[82]  Vivekanand Gopalkrishnan,et al.  Efficient Pruning Schemes for Distance-Based Outlier Detection , 2009, ECML/PKDD.

[83]  Yanchun Liang,et al.  Incorporating Biological Knowledge into Density-Based Clustering Analysis of Gene Expression Data , 2009, 2009 Sixth International Conference on Fuzzy Systems and Knowledge Discovery.

[84]  Raymond J. Mooney,et al.  Integrating constraints and metric learning in semi-supervised clustering , 2004, ICML.

[85]  Larry D. Hostetler,et al.  The estimation of the gradient of a density function, with applications in pattern recognition , 1975, IEEE Trans. Inf. Theory.

[86]  Roger E Bumgarner,et al.  Clustering gene-expression data with repeated measurements , 2003, Genome Biology.

[87]  Ricardo J. G. B. Campello,et al.  Automatic aspect discrimination in data clustering , 2012, Pattern Recognit..

[88]  Hans-Peter Kriegel,et al.  A survey on unsupervised outlier detection in high‐dimensional numerical data , 2012, Stat. Anal. Data Min..

[89]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[90]  F. E. Grubbs Sample Criteria for Testing Outlying Observations , 1950 .

[91]  E. S. Pearson,et al.  THE EFFICIENCY OF STATISTICAL TOOLS AND A CRITERION FOR THE REJECTION OF OUTLYING OBSERVATIONS , 1936 .

[92]  Fei Tony Liu,et al.  Isolation-Based Anomaly Detection , 2012, TKDD.

[93]  Ian Davidson,et al.  Measuring Constraint-Set Utility for Partitional Clustering Algorithms , 2006, PKDD.

[94]  Arthur Zimek,et al.  Data perturbation for outlier detection ensembles , 2014, SSDBM '14.

[95]  Hans-Peter Kriegel,et al.  Local outlier detection reconsidered: a generalized view on locality with applications to spatial, video, and network outlier detection , 2012, Data Mining and Knowledge Discovery.

[96]  Hans-Peter Kriegel,et al.  Generalized Outlier Detection with Flexible Kernel Density Estimates , 2014, SDM.

[97]  Arthur Zimek,et al.  Discriminative features for identifying and interpreting outliers , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[98]  Sridhar Ramaswamy,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD '00.

[99]  Alexander Hinneburg,et al.  DENCLUE 2.0: Fast Clustering Based on Kernel Density Estimation , 2007, IDA.

[100]  Levent Ertoz,et al.  A New Shared Nearest Neighbor Clustering Algorithm and its Applications , 2002 .

[101]  Joydeep Ghosh,et al.  Automated Hierarchical Density Shaving: A Robust Automated Clustering and Visualization Framework for Large Biological Data Sets , 2010, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[102]  Hans-Peter Kriegel,et al.  LoOP: local outlier probabilities , 2009, CIKM.

[103]  Anthony K. H. Tung,et al.  Mining top-n local outliers in large databases , 2001, KDD '01.

[104]  Stefan Berchtold,et al.  Efficient Biased Sampling for Approximate Clustering and Outlier Detection in Large Data Sets , 2003, IEEE Trans. Knowl. Data Eng..

[105]  Sankari Dhandapani,et al.  Design and Implementation of Scalable Hierarchical Density Based Clustering , 2010 .

[106]  P. Sneath The application of computers to taxonomy. , 1957, Journal of general microbiology.

[107]  Joydeep Ghosh,et al.  Hierarchical Density Shaving: A clustering and visualization framework for large biological datasets , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[108]  Michel Herbin,et al.  Estimation of the number of clusters and influence zones , 2001, Pattern Recognit. Lett..

[109]  Emmanuel Müller,et al.  Statistical selection of relevant subspace projections for outlier ranking , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[110]  Zhiyong Lu,et al.  Automatic Extraction of Clusters from Hierarchical Clustering Representations , 2003, PAKDD.

[111]  Raymond T. Ng,et al.  A unified approach for mining outliers , 1997, CASCON.

[112]  A. Cuevas,et al.  Estimating the number of clusters , 2000 .

[113]  Jian Tang,et al.  Enhancing Effectiveness of Outlier Detections for Low Density Patterns , 2002, PAKDD.

[114]  Jiawei Han,et al.  ACM Transactions on Knowledge Discovery from Data: Introduction , 2007 .

[115]  Jörg Sander,et al.  Semi-supervised Density-Based Clustering , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[116]  Jiawei Han,et al.  gSkeletonClu: Density-Based Network Clustering via Structure-Connected Tree Division or Agglomeration , 2010, 2010 IEEE International Conference on Data Mining.

[117]  Ricardo J. G. B. Campello,et al.  Density-Based Clustering Based on Hierarchical Density Estimates , 2013, PAKDD.

[118]  Chenghu Zhou,et al.  A new approach to the nearest‐neighbour method to discover cluster features in overlaid spatial point processes , 2006, Int. J. Geogr. Inf. Sci..

[119]  Brian Everitt,et al.  Cluster analysis , 1974 .

[120]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Efficiency issues of evolutionary k-means , 2011, Appl. Soft Comput..

[121]  Morteza Haghir Chehreghani,et al.  Improving density-based methods for hierarchical clustering of web pages , 2008, Data Knowl. Eng..

[122]  Klemens Böhm,et al.  Outlier Ranking via Subspace Analysis in Multiple Views of the Data , 2012, 2012 IEEE 12th International Conference on Data Mining.

[123]  M. Cugmas,et al.  On comparing partitions , 2015 .

[124]  Victoria J. Hodge,et al.  A Survey of Outlier Detection Methodologies , 2004, Artificial Intelligence Review.

[125]  Clara Pizzuti,et al.  Fast Outlier Detection in High Dimensional Spaces , 2002, PKDD.

[126]  Chenghu Zhou,et al.  DECODE: a new method for discovering clusters of different densities in spatial data , 2009, Data Mining and Knowledge Discovery.