DenPEHC: Density peak based efficient hierarchical clustering

Abstract Existing hierarchical clustering algorithms involve a flat clustering component and an additional agglomerative or divisive procedure. This paper presents a density peak based hierarchical clustering method (DenPEHC), which directly generates clusters on each possible clustering layer, and introduces a grid granulation framework to enable DenPEHC to cluster large-scale and high-dimensional (LSHD) datasets. This study consists of three parts: (1) utilizing the distribution of the parameter γ , which is defined as the product of the local density ρ and the minimal distance to data points with higher density δ in “clustering by fast search and find of density peaks” (DPClust), and a linear fitting approach to select clustering centers with the clustering hierarchy decided by finding the “stairs” in the γ curve; (2) analyzing the leading tree (in which each node except the root is led by its parent to join the same cluster) as an intermediate result of DPClust, and constructing the clustering hierarchy efficiently based on the tree; and (3) designing a framework to enable DenPEHC to cluster LSHD datasets when a large number of attributes can be grouped by their semantics. The proposed method builds the clustering hierarchy by simply disconnecting the center points from their parents with a linear computational complexity O ( m ), where m is the number of clusters. Experiments on synthetic and real datasets show that the proposed method has promising efficiency, accuracy and robustness compared to state-of-the-art methods.

[1]  Zoubin Ghahramani,et al.  Pitman Yor Diffusion Trees for Bayesian Hierarchical Clustering , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Ping Zhu,et al.  Hierarchical Clustering Problems and Analysis of Fuzzy Proximity Relation on Granular Space , 2013, IEEE Transactions on Fuzzy Systems.

[3]  Tao Jiang,et al.  Minimum entropy clustering and applications to gene expression analysis , 2004, Proceedings. 2004 IEEE Computational Systems Bioinformatics Conference, 2004. CSB 2004..

[4]  Tutut Herawan,et al.  MGR: An information theory based hierarchical divisive clustering algorithm for categorical data , 2014, Knowl. Based Syst..

[5]  Witold Pedrycz,et al.  Granular Computing: Perspectives and Challenges , 2013, IEEE Transactions on Cybernetics.

[6]  Ji Feng,et al.  A non-parameter outlier detection algorithm based on Natural Neighbor , 2016, Knowl. Based Syst..

[7]  Alexander Hinneburg,et al.  DENCLUE 2.0: Fast Clustering Based on Kernel Density Estimation , 2007, IDA.

[8]  Yiyu Yao,et al.  Perspectives of granular computing , 2005, 2005 IEEE International Conference on Granular Computing.

[9]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[10]  Huan Liu,et al.  Subspace clustering for high dimensional data: a review , 2004, SKDD.

[11]  Paul Lukowicz,et al.  Collecting complex activity datasets in highly rich networked sensor environments , 2010, 2010 Seventh International Conference on Networked Sensing Systems (INSS).

[12]  Daniel A. Keim,et al.  A General Approach to Clustering in Large Databases with Noise , 2003, Knowledge and Information Systems.

[13]  C. L. Philip Chen,et al.  Data-intensive applications, challenges, techniques and technologies: A survey on Big Data , 2014, Inf. Sci..

[14]  Alexandros Nanopoulos,et al.  Reverse Nearest Neighbors in Unsupervised Distance-Based Outlier Detection , 2015, IEEE Transactions on Knowledge and Data Engineering.

[15]  S. R,et al.  Data Mining with Big Data , 2017, 2017 11th International Conference on Intelligent Systems and Control (ISCO).

[16]  Juha Heinanen,et al.  OF DATA INTENSIVE APPLICATIONS , 1986 .

[17]  Radford M. Neal,et al.  Density Modeling and Clustering Using Dirichlet Diffusion Trees , 2003 .

[18]  Abdolreza Mirzaei,et al.  A Novel Hierarchical-Clustering-Combination Scheme Based on Fuzzy-Similarity Relations , 2010, IEEE Transactions on Fuzzy Systems.

[19]  Peter Grünwald,et al.  A tutorial introduction to the minimum description length principle , 2004, ArXiv.

[20]  M. Cugmas,et al.  On comparing partitions , 2015 .

[21]  Victoria J. Hodge,et al.  A Survey of Outlier Detection Methodologies , 2004, Artificial Intelligence Review.

[22]  Sean Hughes,et al.  Clustering by Fast Search and Find of Density Peaks , 2016 .

[23]  Ming-Syan Chen,et al.  Combining Partitional and Hierarchical Algorithms for Robust and Efficient Data Clustering with Cohesion Self-Merging , 2005, IEEE Trans. Knowl. Data Eng..

[24]  Luís A. Alexandre,et al.  LEGClust—A Clustering Algorithm Based on Layered Entropic Subgraphs , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Zoubin Ghahramani,et al.  Message Passing Algorithms for the Dirichlet Diffusion Tree , 2011, ICML.

[26]  G. A. Miller THE PSYCHOLOGICAL REVIEW THE MAGICAL NUMBER SEVEN, PLUS OR MINUS TWO: SOME LIMITS ON OUR CAPACITY FOR PROCESSING INFORMATION 1 , 1956 .

[27]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[28]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[29]  Alessandro Laio,et al.  Clustering by fast search and find of density peaks , 2014, Science.

[30]  Athman Bouguettaya,et al.  Efficient agglomerative hierarchical clustering , 2015, Expert Syst. Appl..

[31]  Abdolreza Mirzaei,et al.  A hierarchical clusterer ensemble method based on boosting theory , 2013, Knowl. Based Syst..

[32]  Mikhail Belkin,et al.  Laplacian Eigenmaps for Dimensionality Reduction and Data Representation , 2003, Neural Computation.

[33]  Didier Stricker,et al.  Introducing a New Benchmarked Dataset for Activity Monitoring , 2012, 2012 16th International Symposium on Wearable Computers.

[34]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[35]  Jean-Philippe Thiran,et al.  Cluster validity measure and merging system for hierarchical clustering considering outliers , 2015, Pattern Recognit..

[36]  Ricardo Chavarriaga,et al.  The Opportunity challenge: A benchmark database for on-body sensor-based activity recognition , 2013, Pattern Recognit. Lett..

[37]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[38]  Joydeep Ghosh,et al.  Data Clustering Algorithms And Applications , 2013 .

[39]  Christian Böhm,et al.  HISSCLU: a hierarchical density-based method for semi-supervised clustering , 2008, EDBT '08.

[40]  Witold Pedrycz,et al.  Knowledge-based clustering - from data to information granules , 2007 .

[41]  James Bailey,et al.  Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance , 2010, J. Mach. Learn. Res..

[42]  Katherine A. Heller,et al.  Bayesian hierarchical clustering , 2005, ICML.

[43]  Christian Böhm,et al.  Synchronization-Inspired Partitioning and Hierarchical Clustering , 2013, IEEE Transactions on Knowledge and Data Engineering.