Histogram Methods for Unsupervised Clustering

A new algorithm is presented to find clusters in a dataset of points in Rn with no prior knowledge of possible clustering. The algorithm detects clusters in a top down fashion by testing modality of density functions generated from the dataset and splitting the set accordingly. Results on synthetic and text datasets demonstrate that the method is comparable to other established unsupervised learning algorithms, which do in fact require the number of clusters ab initio. The method proves to be particularly suitable for certain distributions and offers a valid alternative in situations where most of the well-known algorithms do not produce consistent results.

[1]  Matthew P. Wand,et al.  Kernel Smoothing , 1995 .

[2]  Gregory Shakhnarovich,et al.  An investigation of computational and informational limits in Gaussian mixture clustering , 2006, ICML '06.

[3]  D. Defays,et al.  An Efficient Algorithm for a Complete Link Method , 1977, Comput. J..

[4]  Ian H. Witten,et al.  Data Mining, Fourth Edition: Practical Machine Learning Tools and Techniques , 2016 .

[5]  A. C. Rencher Methods of multivariate analysis , 1995 .

[6]  Argyris Kalogeratos,et al.  Dip-means: an incremental clustering method for estimating the number of clusters , 2012, NIPS.

[7]  Hans-Peter Kriegel,et al.  Density-based clustering of uncertain data , 2005, KDD '05.

[8]  J. Hartigan,et al.  The Dip Test of Unimodality , 1985 .

[9]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[10]  Vijay K. Madisetti,et al.  The Digital Signal Processing Handbook , 1997 .

[11]  Lucas J. van Vliet,et al.  The digital signal processing handbook , 1998 .

[12]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[13]  William H. Press,et al.  Numerical Recipes: FORTRAN , 1988 .

[14]  B. Lindsay Mixture models : theory, geometry, and applications , 1995 .

[15]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[16]  Naftali Tishby,et al.  Sufficient Dimensionality Reduction , 2003, J. Mach. Learn. Res..

[17]  E. Forgy,et al.  Cluster analysis of multivariate data : efficiency versus interpretability of classifications , 1965 .

[18]  Pasi Fränti,et al.  K-means properties on six clustering benchmark datasets , 2018, Applied Intelligence.

[19]  William F. Christensen,et al.  Methods of Multivariate Analysis: Rencher/Methods , 2012 .