DESPOTA: DEndrogram Slicing through a PemutatiOn Test Approach

Hierarchical clustering represents one of the most widespread analytical approaches to tackle classification problems mainly due to the visual powerfulness of the associated graphical representation, the dendrogram. That said, the requirement of appropriately choosing the number of clusters still represents the main difficulty for the final user. We introduce DESPOTA (DEndrogram Slicing through a PermutatiOn Test Approach), a novel approach exploiting permutation tests in order to automatically detect a partition among those embedded in a dendrogram. Unlike the traditional approach, DESPOTA includes in the search space also partitions not corresponding to horizontal cuts of the dendrogram. Applications on both real and syntethic datasets will show the effectiveness of our proposal.

[1]  V. Boiteau,et al.  Title an Examination of Indices for Determining the Number of Clusters : Nbclust Package , 2012 .

[2]  Harry Joe,et al.  Separation index and partial membership for clustering , 2006, Comput. Stat. Data Anal..

[3]  Luis F. Lago-Fernández,et al.  Normality-based validation for crisp clustering , 2010, Pattern Recognit..

[4]  Minho Kim,et al.  New indices for cluster validity assessment , 2005, Pattern Recognit. Lett..

[5]  Harry Joe,et al.  Generation of Random Clusters with Specified Degree of Separation , 2006, J. Classif..

[6]  David W. Scott The New S Language , 1990 .

[7]  Pedro M. Valero-Mora,et al.  ggplot2: Elegant Graphics for Data Analysis , 2010 .

[8]  Iñaki Albisua,et al.  SEP/COP: An efficient method to find the best partition in hierarchical clustering based on a new cluster validity index , 2010, Pattern Recognit..

[9]  Alan Julian Izenman,et al.  Modern Multivariate Statistical Techniques , 2008 .

[10]  D. Steinley Properties of the Hubert-Arabie adjusted Rand index. , 2004, Psychological methods.

[11]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[12]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[13]  G. W. Milligan,et al.  A monte carlo study of thirty internal criterion measures for cluster analysis , 1981 .

[14]  Hidetoshi Shimodaira,et al.  Approximately unbiased tests of regions using multistep-multiscale bootstrap resampling , 2004, math/0508602.

[15]  Peter J. Park,et al.  A permutation test for determining significance of clusters with applications to spatial and gene expression data , 2009, Comput. Stat. Data Anal..

[16]  Miin-Shen Yang,et al.  Robust cluster validity indexes , 2009, Pattern Recognit..

[17]  Richard A. Becker,et al.  The New S Language , 1989 .

[18]  Malika Charrad,et al.  NbClust: An R Package for Determining the Relevant Number of Clusters in a Data Set , 2014 .

[19]  Paul Horton,et al.  A Probabilistic Classification System for Predicting the Cellular Localization Sites of Proteins , 1996, ISMB.

[20]  Richard A. Johnson,et al.  Applied Multivariate Statistical Analysis , 1983 .

[21]  Y. Hochberg A sharper Bonferroni procedure for multiple tests of significance , 1988 .

[22]  G. W. Milligan,et al.  An examination of procedures for determining the number of clusters in a data set , 1985 .

[23]  Hadley Wickham,et al.  ggplot2 - Elegant Graphics for Data Analysis (2nd Edition) , 2017 .

[24]  Luigi Salmaso,et al.  Permutation Tests for Complex Data , 2010 .

[25]  David Wishart,et al.  256 NOTE: An Algorithm for Hierarchical Classifications , 1969 .

[26]  Matthijs J. Warrens,et al.  On the Equivalence of Cohen’s Kappa and the Hubert-Arabie Adjusted Rand Index , 2008, J. Classif..

[27]  Brian Everitt,et al.  Cluster analysis , 1974 .

[28]  L. Salmaso,et al.  Permutation tests for complex data : theory, applications and software , 2010 .

[29]  L. Hubert,et al.  A general statistical framework for assessing categorical clustering in free recall. , 1976 .

[30]  A. Nobel,et al.  Statistical Significance of Clustering for High-Dimension, Low–Sample Size Data , 2008 .

[31]  B. Everitt,et al.  Cluster Analysis: Low Temperatures and Voting in Congress , 2001 .

[32]  T. Caliński,et al.  A dendrite method for cluster analysis , 1974 .

[33]  Azeem M. Shaikh,et al.  FORMALIZED DATA SNOOPING BASED ON GENERALIZED ERROR RATES , 2007, Econometric Theory.

[34]  A. Raftery,et al.  Model-based Gaussian and non-Gaussian clustering , 1993 .

[35]  Charles E. Clark,et al.  Monte Carlo , 2006 .

[36]  L. Fisher,et al.  391: A Monte Carlo Comparison of Six Clustering Procedures , 1975 .

[37]  M. Cugmas,et al.  On comparing partitions , 2015 .

[38]  Charles E. Heckler,et al.  Applied Multivariate Statistical Analysis , 2005, Technometrics.

[39]  S. Holm A Simple Sequentially Rejective Multiple Test Procedure , 1979 .