New efficient clustering quality indexes

This paper deals with a major challenge in clustering that is optimal model selection. It presents new efficient clustering quality indexes relying on feature maximization, which is an alternative measure to usual distributional measures relying on entropy, Chi-square metric or vector-based measures such as Euclidean distance or correlation distance. First Experiments compare the behavior of these new indexes with usual cluster quality indexes based on Euclidean distance on different kinds of test datasets for which ground truth is available. This comparison clearly highlights altogether the superior accuracy and stability of the new method on these datasets, its efficiency from low to high dimensional range and its tolerance to noise. Further experiments are then conducted on “real life” textual data extracted from a multisource bibliographic database for which ground truth is unknown. These experiments show that the accuracy and stability of these new indexes allow to deal efficiently with diachronic analysis, when other indexes do not fit the requirements for this task.

[1]  Alexander Kolesnikov,et al.  Estimating the number of clusters in a numerical data set via quantization error modeling , 2015, Pattern Recognit..

[2]  Hui Xiong,et al.  Understanding of Internal Clustering Validation Measures , 2010, 2010 IEEE International Conference on Data Mining.

[3]  Gerardo Beni,et al.  A Validity Measure for Fuzzy Clustering , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[4]  Fernando J. Corbacho,et al.  Using the Negentropy Increment to Determine the Number of Clusters , 2009, IWANN.

[5]  Matthieu Latapy,et al.  Computing Communities in Large Networks Using Random Walks , 2004, J. Graph Algorithms Appl..

[6]  Greg Hamerly,et al.  Learning the k in k-means , 2003, NIPS.

[7]  Hans-Hermann Bock,et al.  PROBABILITY MODELS AND HYPOTHESES TESTING IN PARTITIONING CLUSTER ANALYSIS , 1996 .

[8]  Jean-Charles Lamirel,et al.  Diachronic'Explorer: Keep track of your clusters , 2016, 2016 IEEE Tenth International Conference on Research Challenges in Information Science (RCIS).

[9]  Jean-Charles Lamirel A new approach for automatizing the analysis of research topics dynamics: application to optoelectronics research , 2012, Scientometrics.

[10]  Eréndira Rendón,et al.  Internal versus External cluster validation indexes , 2011 .

[11]  Jean-Charles Lamirel,et al.  Variations to incremental growing neural gas algorithm based on label maximization , 2011, The 2011 International Joint Conference on Neural Networks.

[12]  Adolfo Guzmán-Arenas,et al.  Efficiently Finding the Optimum Number of Clusters in a Dataset with a New Hybrid Cellular Evolutionary Algorithm , 2014 .

[13]  Jean-Charles Lamirel,et al.  Feature selection and complex networks methods for an analysis of collaboration evolution in science: an application to the ISTEX digital library , 2015 .

[14]  Jean-Charles Lamirel,et al.  Classifying French Verbs Using French and English Lexical Resources , 2012, ACL.

[15]  Concha Bielza,et al.  A comparison of clustering quality indices using outliers and noise , 2012, Intell. Data Anal..

[16]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[17]  Jean-Charles Lamirel,et al.  Optimizing text classification through efficient feature selection based on quality metric , 2014, Journal of Intelligent Information Systems.

[18]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Jean-Charles Lamirel,et al.  New classification quality estimators for analysis of documentary information: Application to patent analysis and web mapping , 2004, Scientometrics.

[20]  Jean-Charles Lamirel,et al.  Feature-based cluster validation for high-dimensional data , 2008 .

[21]  G. W. Milligan,et al.  An examination of procedures for determining the number of clusters in a data set , 1985 .

[22]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[23]  Thierry Poibeau,et al.  Investigating the cross-linguistic potential of VerbNet-style classification , 2010, COLING.

[24]  S. Angel Latha Mary,et al.  CLUSTER VALIDITY MEASURES DYNAMIC CLUSTERING ALGORITHMS , 2015 .

[25]  J. Dunn Well-Separated Clusters and Optimal Fuzzy Partitions , 1974 .

[26]  Michalis Vazirgiannis,et al.  c ○ 2001 Kluwer Academic Publishers. Manufactured in The Netherlands. On Clustering Validation Techniques , 2022 .

[27]  S. Dolnicar,et al.  An examination of indexes for determining the number of clusters in binary data sets , 2002, Psychometrika.

[28]  T. Caliński,et al.  A dendrite method for cluster analysis , 1974 .

[29]  Bernd Fritzke,et al.  A Growing Neural Gas Network Learns Topologies , 1994, NIPS.