Probabilistic density-based estimation of the number of clusters using the DBSCAN-martingale process

Abstract Density-based clustering is an effective clustering approach that groups together dense patterns in low- and high-dimensional vectors, especially when the number of clusters is unknown. Such vectors are obtained for example when computer scientists represent unstructured data and then groups them into clusters in an unsupervised way. Another facet of clustering similar artifacts is the detection of densely connected nodes in network structures, where communities of nodes are formulated and need to be identified. To that end, we propose a new DBSCAN algorithm for estimating the number of clusters by optimizing a probabilistic process, namely DBSCAN-Martingale, which involves randomness in the selection of density parameter. We minimize the number of iterations required to extract all clusters by the DBSCAN-Martingale process, by providing an analytic formula. Experiments on spatial, textual and visual clustering show that the proposed analytic formula provides a suitable indicator for the optimal number of required iterations to extract all clusters.

[1]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[2]  Yiannis Kompatsiaris,et al.  VERGE in VBS 2017 , 2017, MMM.

[3]  Anil K. Jain,et al.  Face Clustering: Representation and Pairwise Constraints , 2017, IEEE Transactions on Information Forensics and Security.

[4]  M E J Newman,et al.  Community structure in social and biological networks , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[5]  Chin-Teng Lin,et al.  A review of clustering techniques and developments , 2017, Neurocomputing.

[6]  Cheng Wang,et al.  A fast clustering algorithm based on pruning unnecessary distance computations in DBSCAN for high-dimensional data , 2018, Pattern Recognit..

[7]  Yufei Tao,et al.  DBSCAN Revisited: Mis-Claim, Un-Fixability, and Approximation , 2015, SIGMOD Conference.

[8]  Michalis Vazirgiannis,et al.  Clustering and Community Detection in Directed Networks: A Survey , 2013, ArXiv.

[9]  Michael I. Jordan,et al.  Revisiting k-means: New Algorithms via Bayesian Nonparametrics , 2011, ICML.

[10]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[11]  Rudolf Kruse,et al.  Variable density based clustering , 2016, 2016 IEEE Symposium Series on Computational Intelligence (SSCI).

[12]  Jun S. Liu,et al.  STATISTICAL APPLICATIONS OF THE POISSON-BINOMIAL AND CONDITIONAL BERNOULLI DISTRIBUTIONS , 1997 .

[13]  Michael I. Jordan,et al.  Modeling annotated data , 2003, SIGIR.

[14]  Yiannis Kompatsiaris,et al.  Community detection in Social Media , 2012, Data Mining and Knowledge Discovery.

[15]  Qingshan Liu,et al.  Elastic Net Hypergraph Learning for Image Clustering and Semi-Supervised Classification , 2016, IEEE Transactions on Image Processing.

[16]  Ioannis Patras,et al.  Cascade of classifiers based on binary, non-binary and deep convolutional network descriptors for video concept detection , 2015, 2015 IEEE International Conference on Image Processing (ICIP).

[17]  M E J Newman,et al.  Finding and evaluating community structure in networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[18]  Ioannis Kompatsiaris,et al.  Topic detection using the DBSCAN-Martingale and the time operator , 2017 .

[19]  Hans-Peter Kriegel,et al.  DBSCAN Revisited, Revisited , 2017, ACM Trans. Database Syst..

[20]  Anil K. Jain Data clustering: 50 years beyond K-means , 2010, Pattern Recognit. Lett..

[21]  Ricardo J. G. B. Campello,et al.  Density-Based Clustering Based on Hierarchical Density Estimates , 2013, PAKDD.

[22]  Ioannis Antoniou,et al.  Age and Time Operator of Evolutionary Processes , 2015, QI.

[23]  Ilias Gialampoukidis,et al.  Financial Time Operator for random walk markets , 2013 .

[24]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[25]  Johannes Schneider,et al.  Fast parameterless density-based clustering via random projections , 2013, CIKM.

[26]  Ira Assent,et al.  Scalable and Interactive Graph Clustering Algorithm on Multicore CPUs , 2017, 2017 IEEE 33rd International Conference on Data Engineering (ICDE).

[27]  Yiannis Kompatsiaris,et al.  A Topic Detection and Visualisation System on Social Media Posts , 2017, INSCI.

[28]  G. Getz,et al.  Coupled two-way clustering analysis of gene microarray data. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[29]  Yiannis Kompatsiaris,et al.  Community detection in complex networks based on DBSCAN* and a Martingale process , 2016, 2016 11th International Workshop on Semantic and Social Media Adaptation and Personalization (SMAP).

[30]  Yili Hong,et al.  On computing the distribution function for the Poisson binomial distribution , 2013, Comput. Stat. Data Anal..

[31]  Martin Rosvall,et al.  Maps of random walks on complex networks reveal community structure , 2007, Proceedings of the National Academy of Sciences.

[32]  Soumaya Louhichi,et al.  A density based algorithm for discovering clusters with varied density , 2014, 2014 World Congress on Computer Applications and Information Systems (WCCAIS).

[33]  Ludvig Bohlin,et al.  Community detection and visualization of networks with the map equation framework , 2014 .

[34]  Yiannis Kompatsiaris,et al.  A Hybrid Framework for News Clustering Based on the DBSCAN-Martingale and LDA , 2016, MLDM.

[35]  Steve Harenberg,et al.  Community detection in large‐scale networks: a survey and empirical evaluation , 2014 .

[36]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[37]  J. Doob Stochastic processes , 1953 .

[38]  Yiannis Kompatsiaris,et al.  Graph-Based Multimodal Clustering for Social Event Detection in Large Collections of Images , 2014, MMM.

[39]  Santo Fortunato,et al.  Community detection in graphs , 2009, ArXiv.