A novel validity index with dynamic cut-off for determining true clusters

In a multi-surveillance environment, voluminous data is generated over a period of time. Data analysis for summarization and conclusion has paved a way for the need of an efficient clusterization. Clustering, an unsupervised way of learning about data aims at defining clusters. Validation of clusters formed indicates the trueness of the clusters. In this paper, a novel validation technique with dynamic termination of clustering process has been proposed to obtain true clusters. In the validation process, the validity index is based on both global cluster proximity relationship and local proximity relationship. The validity index is computed for validating the available clusters using 'within-cluster sum-of-squares', 'between-cluster sum-of-squares', 'total-sum-of-squares', 'intra-cluster distances' and 'inter-cluster distances'. The ratio between two consecutive validity indices is the extent of variation which specifies the cut-off point. Cut-off terminates the clustering process dynamically indicating the number of clusters and validates the obtained clusters. The proposed method is tested on several real and synthetic data sets. Comparisons with the existing methods demonstrate the efficiency of the proposed method in detecting true clusters. A novel validation technique has been proposed to obtain true clusters.The method dynamically terminates the clustering at true conception of clusters.Global and local proximity relationship of clusters are considered for validation.

[1]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[2]  Gérard Govaert,et al.  Assessing a Mixture Model for Clustering with the Integrated Completed Likelihood , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[3]  David R. Anderson,et al.  Multimodel Inference , 2004 .

[4]  G. B. Mufti,et al.  Determining the number of groups from measures of cluster stability , 2005 .

[5]  Joachim M. Buhmann,et al.  A Resampling Approach to Cluster Validation , 2002, COMPSTAT.

[6]  Eytan Domany,et al.  Resampling Method for Unsupervised Estimation of Cluster Validity , 2001, Neural Computation.

[7]  Lior Rokach,et al.  Clustering Methods , 2005, The Data Mining and Knowledge Discovery Handbook.

[8]  Adrian E. Raftery,et al.  How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis , 1998, Comput. J..

[9]  Philip Chan,et al.  Determining the number of clusters/segments in hierarchical clustering/segmentation algorithms , 2004, 16th IEEE International Conference on Tools with Artificial Intelligence.

[10]  J. Dunn Well-Separated Clusters and Optimal Fuzzy Partitions , 1974 .

[11]  Olatz Arbelaitz,et al.  An extensive comparative study of cluster validity indices , 2013, Pattern Recognit..

[12]  Pasi Fränti,et al.  Knee Point Detection in BIC for Detecting the Number of Clusters , 2008, ACIVS.

[13]  Michalis Vazirgiannis,et al.  On Clustering Validation Techniques , 2001, Journal of Intelligent Information Systems.

[14]  Lior Rokach,et al.  Data Mining And Knowledge Discovery Handbook , 2005 .

[15]  Nils Lid Hjort,et al.  Model Selection and Model Averaging , 2001 .

[16]  T. Caliński,et al.  A dendrite method for cluster analysis , 1974 .

[17]  Csaba Legány,et al.  Cluster validity measurement techniques , 2006 .

[18]  Ujjwal Maulik,et al.  Validity index for crisp and fuzzy clusters , 2004, Pattern Recognit..

[19]  H. Bozdogan Choosing the Number of Component Clusters in the Mixture-Model Using a New Informational Complexity Criterion of the Inverse-Fisher Information Matrix , 1993 .

[20]  Sahana D. Gowda,et al.  An hybrid validity index for dynamic cut-off in hierarchical agglomerative clustering , 2014, 2014 International Conference on Advances in Computing, Communications and Informatics (ICACCI).

[21]  Jill P. Mesirov,et al.  Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data , 2003, Machine Learning.

[22]  G. W. Milligan,et al.  An examination of procedures for determining the number of clusters in a data set , 1985 .

[23]  Anil K. Jain,et al.  Unsupervised Learning of Finite Mixture Models , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[24]  Lei Xu,et al.  Investigation on Several Model Selection Criteria for Determining the Number of Cluster , 2004 .

[25]  G. McLachlan,et al.  On a resampling approach for tests on the number of clusters with mixture model-based clustering of tissue samples , 2004 .

[26]  G. J. McLachlana,et al.  On a resampling approach for tests on the number of clusters with mixture model-based clustering of tissue samples , 2004 .

[27]  Eréndira Rendón,et al.  Internal versus External cluster validation indexes , 2011 .

[28]  Janice L. DuBien,et al.  A method of predicting the number of clusters using Rand's statistic , 2006, Comput. Stat. Data Anal..

[29]  Ludmila I. Kuncheva,et al.  Evaluation of Stability of k-Means Cluster Ensembles with Respect to Random Initialization , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  L. Hubert,et al.  A general statistical framework for assessing categorical clustering in free recall. , 1976 .

[31]  N. Hjort,et al.  Comprar Model Selection and Model Averaging | Gerda Claeskens | 9780521852258 | Cambridge University Press , 2008 .

[32]  Jill P. Mesirov,et al.  A resampling-based method for class discovery and visualization of gene expression microarray data , 2003 .

[33]  W. Krzanowski,et al.  A Criterion for Determining the Number of Groups in a Data Set Using Sum-of-Squares Clustering , 1988 .

[34]  Anil K. Jain Data clustering: 50 years beyond K-means , 2008, Pattern Recognit. Lett..

[35]  Siddheswar Ray,et al.  Determination of Number of Clusters in K-Means Clustering and Application in Colour Image Segmentation , 2000 .

[36]  Masashi Sugiyama,et al.  Subspace Information Criterion for Model Selection , 2001, Neural Computation.

[37]  S. Dudoit,et al.  A prediction-based resampling method for estimating the number of clusters in a dataset , 2002, Genome Biology.

[38]  Catherine A. Sugar,et al.  Finding the Number of Clusters in a Dataset , 2003 .

[39]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[41]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.