Density‐based clustering

Clustering refers to the task of identifying groups or clusters in a data set. In density‐based clustering, a cluster is a set of data objects spread in the data space over a contiguous region of high density of objects. Density‐based clusters are separated from each other by contiguous regions of low density of objects. Data objects located in low‐density regions are typically considered noise or outliers. In this review article we discuss the statistical notion of density‐based clusters, classic algorithms for deriving a flat partitioning of density‐based clusters, methods for hierarchical density‐based clustering, and methods for semi‐supervised clustering. We conclude with some open challenges related to density‐based clustering.

[1]  Jörg Sander,et al.  Semi-supervised Density-Based Clustering , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[2]  Ricardo J. G. B. Campello,et al.  Density-Based Clustering Based on Hierarchical Density Estimates , 2013, PAKDD.

[3]  Alexander Hinneburg,et al.  DENCLUE 2.0: Fast Clustering Based on Kernel Density Estimation , 2007, IDA.

[4]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[5]  Hans-Peter Kriegel,et al.  Subspace clustering , 2012, WIREs Data Mining Knowl. Discov..

[6]  W. Fitch,et al.  Construction of phylogenetic trees. , 1967, Science.

[7]  Arthur Zimek,et al.  Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection , 2015, ACM Trans. Knowl. Discov. Data.

[8]  Arthur Zimek,et al.  There and back again: Outlier detection between statistical reasoning and data mining algorithms , 2018, WIREs Data Mining Knowl. Discov..

[9]  Robin Sibson,et al.  The Construction of Hierarchic and Non-Hierarchic Classifications , 1968, Comput. J..

[10]  Tom Lane,et al.  A Kth Nearest Neighbour Clustering Procedure , 2015 .

[11]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[12]  Chenghu Zhou,et al.  DECODE: a new method for discovering clusters of different densities in spatial data , 2009, Data Mining and Knowledge Discovery.

[13]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[14]  A. Cuevas,et al.  Cluster analysis: a further approach based on density estimation , 2001 .

[15]  Michael E. Houle,et al.  Rank-Based Similarity Search: Reducing the Dimensional Dependence , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Kai Ming Ting,et al.  Local contrast as an effective means to robust clustering against varying densities , 2017, Machine Learning.

[17]  Werner Stuetzle,et al.  Estimating the Cluster Tree of a Density by Analyzing the Minimal Spanning Tree of a Sample , 2003, J. Classif..

[18]  Elke Achtert,et al.  On Exploring Complex Relationships of Correlation Clusters , 2007, 19th International Conference on Scientific and Statistical Database Management (SSDBM 2007).

[19]  Ken-ichi Kawarabayashi,et al.  Extreme-value-theoretic estimation of local intrinsic dimensionality , 2018, Data Mining and Knowledge Discovery.

[20]  Hans-Peter Kriegel,et al.  DBSCAN Revisited, Revisited , 2017, ACM Trans. Database Syst..

[21]  G. N. Lance,et al.  A General Theory of Classificatory Sorting Strategies: 1. Hierarchical Systems , 1967, Comput. J..

[22]  Arthur Zimek,et al.  A framework for semi-supervised and unsupervised optimal extraction of clusters from hierarchies , 2013, Data Mining and Knowledge Discovery.

[23]  Hans-Peter Kriegel,et al.  Density-based Projected Clustering over High Dimensional Data Streams , 2012, SDM.

[24]  L. Devroye,et al.  The Strong Uniform Consistency of Nearest Neighbor Density Estimates. , 1977 .

[25]  Vincent Kanade,et al.  Clustering Algorithms , 2021, Wireless RF Energy Transfer in the Massive IoT Era.

[26]  M. Vazirgiannis,et al.  Clustering validity assessment using multi representatives , 2002 .

[27]  Michael E. Houle,et al.  Local Intrinsic Dimensionality II: Multivariate Analysis and Distributional Support , 2017, SISAP.

[28]  P. Sneath The application of computers to taxonomy. , 1957, Journal of general microbiology.

[29]  Hans-Peter Kriegel,et al.  Incremental Clustering for Mining in a Data Warehousing Environment , 1998, VLDB.

[30]  Hans-Peter Kriegel,et al.  A survey on unsupervised outlier detection in high‐dimensional numerical data , 2012, Stat. Anal. Data Min..

[31]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[32]  Christian Böhm,et al.  Density connected clustering with local subspace preferences , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[33]  Aoying Zhou,et al.  Density-Based Clustering over an Evolving Data Stream with Noise , 2006, SDM.

[34]  Michael E. Houle,et al.  Local Intrinsic Dimensionality I: An Extreme-Value-Theoretic Foundation for Similarity Applications , 2017, SISAP.

[35]  David R. Karger,et al.  Finding nearest neighbors in growth-restricted metrics , 2002, STOC '02.

[36]  Kai Ming Ting,et al.  Density-ratio based clustering for discovering clusters with varying densities , 2016, Pattern Recognit..

[37]  Hans-Peter Kriegel,et al.  Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications , 1998, Data Mining and Knowledge Discovery.

[38]  Elke Achtert,et al.  Robust, Complete, and Efficient Correlation Clustering , 2007, SDM.

[39]  Michael E. Houle,et al.  Dimensionality, Discriminability, Density and Distance Distributions , 2013, 2013 IEEE 13th International Conference on Data Mining Workshops.

[40]  Arthur Zimek,et al.  Density-Based Clustering Validation , 2014, SDM.

[41]  Yufei Tao,et al.  On the Hardness and Approximation of Euclidean DBSCAN , 2017, ACM Trans. Database Syst..

[42]  J. Yackel,et al.  Consistency Properties of Nearest Neighbor Density Function Estimators , 1977 .

[43]  Vipin Kumar,et al.  Finding Clusters of Different Sizes, Shapes, and Densities in Noisy, High Dimensional Data , 2003, SDM.

[44]  Ricardo J. G. B. Campello,et al.  A Modularity-Based Measure for Cluster Selection from Clustering Hierarchies , 2018, AusDM.

[45]  Ricardo J. G. B. Campello,et al.  Efficient Computation of Multiple Density-Based Clustering Hierarchies , 2017, 2017 IEEE International Conference on Data Mining (ICDM).

[46]  M. Rosenblatt Remarks on Some Nonparametric Estimates of a Density Function , 1956 .

[47]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[48]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[49]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[50]  J. Hartigan Direct Clustering of a Data Matrix , 1972 .

[51]  Hans-Peter Kriegel,et al.  The (black) art of runtime evaluation: Are we comparing algorithms or implementations? , 2017, Knowledge and Information Systems.

[52]  C. Quesenberry,et al.  A nonparametric estimate of a multivariate density function , 1965 .

[53]  Hans-Peter Kriegel,et al.  Density-Connected Subspace Clustering for High-Dimensional Data , 2004, SDM.

[54]  Arthur Zimek,et al.  Frequent Pattern Mining Algorithms for Data Clustering , 2014, Frequent Pattern Mining.

[55]  W. T. Williams,et al.  Multivariate Methods in Plant Ecology: V. Similarity Analyses and Information-Analysis , 1966 .

[56]  Ricardo J. G. B. Campello,et al.  MustaCHE: A Multiple Clustering Hierarchies Explorer , 2018, Proc. VLDB Endow..

[57]  Ricardo J. G. B. Campello,et al.  Hierarchical Density-Based Clustering Using MapReduce , 2019, IEEE Transactions on Big Data.

[58]  Arthur Zimek,et al.  A unified view of density-based methods for semi-supervised clustering and classification , 2019, Data Mining and Knowledge Discovery.

[59]  E. Parzen On Estimation of a Probability Density Function and Mode , 1962 .

[60]  Thomas Seidl,et al.  Subspace correlation clustering: finding locally correlated dimensions in subspace projections of the data , 2012, KDD.

[61]  Ira Assent,et al.  DUSC: Dimensionality Unbiased Subspace Clustering , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[62]  Zhiyong Lu,et al.  Automatic Extraction of Clusters from Hierarchical Clustering Representations , 2003, PAKDD.

[63]  Christian Böhm,et al.  Computing Clusters of Correlation Connected objects , 2004, SIGMOD '04.

[64]  Hans-Peter Kriegel,et al.  Data bubbles: quality preserving performance boosting for hierarchical clustering , 2001, SIGMOD '01.

[65]  Daniel Barbará,et al.  Requirements for clustering data streams , 2002, SKDD.

[66]  Robin Sibson,et al.  SLINK: An Optimally Efficient Algorithm for the Single-Link Cluster Method , 1973, Comput. J..

[67]  Ira Assent,et al.  EDSC: efficient density-based subspace clustering , 2008, CIKM '08.

[68]  J. Carmichael,et al.  FINDING NATURAL CLUSTERS , 1968 .

[69]  Daniel A. Keim,et al.  An Efficient Approach to Clustering in Large Multimedia Databases with Noise , 1998, KDD.

[70]  Thomas Seidl,et al.  Subspace clustering of data streams: new algorithms and effective evaluation measures , 2014, Journal of Intelligent Information Systems.

[71]  Hans-Peter Kriegel,et al.  Can Shared-Neighbor Distances Defeat the Curse of Dimensionality? , 2010, SSDBM.

[72]  Ricardo J. G. B. Campello,et al.  Relative clustering validity criteria: A comparative overview , 2010, Stat. Anal. Data Min..

[73]  Ira Assent,et al.  Clustering high dimensional data , 2012 .

[74]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Data stream clustering: A survey , 2013, CSUR.

[75]  Tommy W. S. Chow,et al.  Clustering of the self-organizing map using a clustering validity index based on inter-cluster and intra-cluster density , 2004, Pattern Recognit..

[76]  Li Tu,et al.  Density-based clustering for real-time stream data , 2007, KDD '07.

[77]  Christian Böhm,et al.  HISSCLU: a hierarchical density-based method for semi-supervised clustering , 2008, EDBT '08.

[78]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[79]  Arthur Zimek,et al.  ELKI: A large open-source library for data analysis - ELKI Release 0.7.5 "Heidelberg" , 2019, ArXiv.

[80]  Kai Ming Ting,et al.  A Distance Scaling Method to Improve Density-Based Clustering , 2018, PAKDD.

[81]  Michalis Vazirgiannis,et al.  A density-based cluster validity approach using multi-representatives , 2008, Pattern Recognit. Lett..

[82]  Martin Ester,et al.  Density‐based clustering , 2019, WIREs Data Mining Knowl. Discov..

[83]  Arthur Zimek,et al.  A unified framework of density-based clustering for semi-supervised classification , 2018, SSDBM.

[84]  Ken-ichi Kawarabayashi,et al.  Estimating Local Intrinsic Dimensionality , 2015, KDD.

[85]  Ira Assent,et al.  Relevant Subspace Clustering: Mining the Most Interesting Non-redundant Concepts in High Dimensional Data , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[86]  Pablo M. Granitto,et al.  How Many Clusters: A Validation Index for Arbitrary-Shaped Clusters , 2013, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[87]  Ken-ichi Kawarabayashi,et al.  Intrinsic Dimensionality Estimation within Tight Localities , 2019, SDM.

[88]  Hans-Peter Kriegel,et al.  Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering , 2009, TKDD.