A Novel Approach to the Problem of Non-uniqueness of the Solution in Hierarchical Clustering

The existence of multiple solutions in clustering, and in hierarchical clustering in particular, is often ignored in practical applications. However, this is a non-trivial problem, as different data orderings can result in different cluster sets that, in turns, may lead to different interpretations of the same data. The method presented here offers a solution to this issue. It is based on the definition of an equivalence relation over dendrograms that allows developing all and only the significantly different dendrograms for the same dataset, thus reducing the computational complexity to polynomial from the exponential obtained when all possible dendrograms are considered. Experimental results in the neuroimaging and bioinformatics domains show the effectiveness of the proposed method.

[1]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[2]  Sergio Gómez,et al.  Solving Non-Uniqueness in Agglomerative Hierarchical Clustering Using Multidendrograms , 2006, J. Classif..

[3]  Giancarlo Ferrigno,et al.  Reducing and Filtering Point Clouds With Enhanced Vector Quantization , 2007, IEEE Transactions on Neural Networks.

[4]  Ting Su,et al.  In search of deterministic methods for initializing K-means and Gaussian mixture clustering , 2007, Intell. Data Anal..

[5]  B. Morgan,et al.  Non-uniqueness and Inversions in Cluster Analysis , 1995 .

[6]  Patrik D'haeseleer,et al.  How does gene expression clustering work? , 2005, Nature Biotechnology.

[7]  Robert L. Goldstone The role of similarity in categorization: providing a groundwork , 1994, Cognition.

[8]  Purvesh Khatri,et al.  Ontological analysis of gene expression data: current tools, limitations, and open problems , 2005, Bioinform..

[9]  U. V. Luxburg,et al.  Towards a Statistical Theory of Clustering , 2005 .

[10]  B. Ripley,et al.  Pattern Recognition , 1968, Nature.

[11]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[12]  R. M. Cormack,et al.  A Review of Classification , 1971 .

[13]  R. Sibson Order Invariant Methods for Data Analysis , 1972 .

[14]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[15]  Anil K. Jain Data clustering: 50 years beyond K-means , 2008, Pattern Recognit. Lett..

[16]  Dale Schuurmans,et al.  Convex Relaxations of Latent Variable Training , 2007, NIPS.

[17]  G Jobard,et al.  Evaluation of the dual route theory of reading: a metanalysis of 35 neuroimaging studies , 2003, NeuroImage.

[18]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[19]  W. Heiser,et al.  Instability of hierarchical cluster analysis due to input order of the data: the PermuCLUSTER solution. , 2005, Psychological methods.

[20]  Eraldo Paulesu,et al.  Reading the reading brain: A new meta-analysis of functional imaging data on reading , 2013, Journal of Neurolinguistics.

[21]  Mike Tyers,et al.  BioGRID: a general repository for interaction datasets , 2005, Nucleic Acids Res..

[22]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[23]  W. T. Williams,et al.  A Generalized Sorting Strategy for Computer Classifications , 1966, Nature.