kpeaks: An R Package for Quick Selection of K for Cluster Analysis

The argument k is a mandatory user-specified input argument for the number of clusters which is required to start all of the partitioning clustering algorithms. In unsupervised learning applications, an optimal value of this argument is generally determined by using any of the internal validity indexes. However, the determination of k with aid of these indexes are computationally very expensive because they compute a k value using the results after several runs of a clustering algorithm. On the contrary, the package ‘kpeaks’ enables to estimate k before starting a clustering session. It is based on a simple novel technique using the descriptive statistics of peak counts of the features in datasets. In this paper, we introduce and illustrate the details of R package ‘kpeaks’ as an implementation for quick selection of the number of clusters for starting cluster algorithms.

[1]  Greg Hamerly,et al.  Learning the k in k-means , 2003, NIPS.

[2]  D. Freedman,et al.  On the histogram as a density estimator:L2 theory , 1981 .

[3]  Michalis Vazirgiannis,et al.  c ○ 2001 Kluwer Academic Publishers. Manufactured in The Netherlands. On Clustering Validation Techniques , 2022 .

[4]  N. Carruthers,et al.  Handbook of Statistical Methods in Meteorology , 1952 .

[5]  David P. Doane,et al.  Aesthetic Frequency Classifications , 1976 .

[6]  David V. Huntsberger,et al.  Elements of statistical inference , 1961 .

[7]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[8]  Anbupalam Thalamuthu,et al.  Gene expression Evaluation and comparison of gene clustering methods in microarray analysis , 2006 .

[9]  D. W. Scott,et al.  Oversmoothed Nonparametric Density Estimates , 1985 .

[10]  R. Greensmith,et al.  Statistical Methods in Research and Production , 1973 .

[11]  Csaba Legány,et al.  Cluster validity measurement techniques , 2006 .

[12]  Wei-Chen Chen,et al.  MixSim: An R Package for Simulating Data to Study Performance of Clustering Algorithms , 2012 .

[13]  D. W. Scott On optimal and data based histograms , 1979 .

[14]  Trupti M. Kodinariya,et al.  Review on determining number of Cluster in K-Means Clustering , 2013 .

[15]  A. Asuncion,et al.  UCI Machine Learning Repository, University of California, Irvine, School of Information and Computer Sciences , 2007 .

[16]  Eréndira Rendón,et al.  Internal versus External cluster validation indexes , 2011 .

[18]  Herbert A. Sturges,et al.  The Choice of a Class Interval , 1926 .