Clustering under composite generative models

This paper studies clustering of data samples generated from composite distributions using the Kolmogorov-Smirnov (KS) based K-means algorithm. All data sequences are assumed to be generated from unknown continuous distributions. The maximum intra-cluster KS distance of each distribution cluster is assumed to be smaller than the minimum inter-cluster KS distance of different clusters. The analysis of convergence and upper bounds on the error probability are provided for both cases with known and unknown number of clusters. Furthermore, it is shown that the probability of error decays exponentially as the number of samples in each data sequence goes to infinity, and the error exponent is only a function of the difference of the inter-cluster and intra-cluster KS distances. The analysis is validated by simulation results.

[1]  Sirin Nitinawarat,et al.  Universal outlier hypothesis testing , 2013, 2013 IEEE International Symposium on Information Theory.

[2]  C.-C. Jay Kuo,et al.  A new initialization technique for generalized Lloyd iteration , 1994, IEEE Signal Processing Letters.

[3]  Llanos Mora-López,et al.  Modelling the distribution of solar spectral irradiance using data mining techniques , 2014 .

[4]  Hui Zou,et al.  The Kolmogorov filter for variable screening in high-dimensional binary classification , 2013 .

[5]  D.M. Mount,et al.  An Efficient k-Means Clustering Algorithm: Analysis and Implementation , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[6]  P. Massart The Tight Constant in the Dvoretzky-Kiefer-Wolfowitz Inequality , 1990 .

[7]  H. Vincent Poor,et al.  Nonparametric Detection of Geometric Structures Over Networks , 2016, IEEE Transactions on Signal Processing.

[8]  David Barber,et al.  Bayesian reasoning and machine learning , 2012 .

[9]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[10]  Pramod K. Varshney,et al.  Exponentially Consistent K-Means Clustering Algorithm Based on Kolmogrov-Smirnov Test , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[12]  Llanos Mora López,et al.  Modelling the distribution of solar spectral irradiance using data mining techniques , 2014, Environ. Model. Softw..

[13]  Juan Mora,et al.  An adaptive algorithm for clustering cumulative probability distribution functions using the Kolmogorov-Smirnov two-sample test , 2015, Expert Syst. Appl..

[14]  R. Stephenson A and V , 1962, The British journal of ophthalmology.

[15]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.