Cross-Clustering: A Partial Clustering Algorithm with Automatic Estimation of the Number of Clusters

Four of the most common limitations of the many available clustering methods are: i) the lack of a proper strategy to deal with outliers; ii) the need for a good a priori estimate of the number of clusters to obtain reasonable results; iii) the lack of a method able to detect when partitioning of a specific data set is not appropriate; and iv) the dependence of the result on the initialization. Here we propose Cross-clustering (CC), a partial clustering algorithm that overcomes these four limitations by combining the principles of two well established hierarchical clustering algorithms: Ward’s minimum variance and Complete-linkage. We validated CC by comparing it with a number of existing clustering methods, including Ward’s and Complete-linkage. We show on both simulated and real datasets, that CC performs better than the other methods in terms of: the identification of the correct number of clusters, the identification of outliers, and the determination of real cluster memberships. We used CC to cluster samples in order to identify disease subtypes, and on gene profiles, in order to determine groups of genes with the same behavior. Results obtained on a non-biological dataset show that the method is general enough to be successfully used in such diverse applications. The algorithm has been implemented in the statistical language R and is freely available from the CRAN contributed packages repository.

[1]  Paul S. Bradley,et al.  Scaling Clustering Algorithms to Large Databases , 1998, KDD.

[2]  Bing Liu,et al.  Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data , 2006, Data-Centric Systems and Applications.

[3]  Aaron M. Newman,et al.  AutoSOME: a clustering method for identifying gene expression modules without prior knowledge of cluster number , 2010, BMC Bioinformatics.

[4]  Jacques Ferlay,et al.  GLOBOCAN 2012: Estimated cancer incidence, mortality and prevalence worldwide in 2012 , 2013 .

[5]  Ian Witten,et al.  Data Mining , 2000 .

[6]  D. Steinley Properties of the Hubert-Arabie adjusted Rand index. , 2004, Psychological methods.

[7]  L. Hubert,et al.  A general statistical framework for assessing categorical clustering in free recall. , 1976 .

[8]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[9]  D. Louis WHO classification of tumours of the central nervous system , 2007 .

[10]  Martin Straume,et al.  DNA Microarray Time Series Analysis: Automated Statistical Assessment of Circadian Rhythms in Gene Expression Patterning , 2004, Numerical Computer Methods, Part D.

[11]  T. Sørensen,et al.  A method of establishing group of equal amplitude in plant sociobiology based on similarity of species content and its application to analyses of the vegetation on Danish commons , 1948 .

[12]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[13]  B. Scheithauer,et al.  The 2007 WHO classification of tumours of the central nervous system , 2007, Acta Neuropathologica.

[14]  A. D. Gordon Null Models in Cluster Validation , 1996 .

[15]  Vincent Kanade,et al.  Clustering Algorithms , 2021, Wireless RF Energy Transfer in the Massive IoT Era.

[16]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[17]  Xiaogang Wang,et al.  Clues: an R Package for Nonparametric Clustering Based on Local Shrinking , 2022 .

[18]  T. Caliński,et al.  A dendrite method for cluster analysis , 1974 .

[19]  L. Hubert,et al.  Quadratic assignment as a general data analysis strategy. , 1976 .

[20]  Xiuzhen Huang,et al.  SPARCoC: A New Framework for Molecular Pattern Discovery and Cancer Gene Identification , 2015, PloS one.

[21]  B. Everitt,et al.  Cluster Analysis: Low Temperatures and Voting in Congress , 2001 .

[22]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[23]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[24]  Steven Myers,et al.  Structure-based kernels for the prediction of catalytic residues and their involvement in human inherited disease , 2010, BMC Bioinformatics.

[25]  Catherine A. Sugar,et al.  Finding the Number of Clusters in a Dataset , 2003 .

[26]  W. Krzanowski,et al.  A Criterion for Determining the Number of Groups in a Data Set Using Sum-of-Squares Clustering , 1988 .

[27]  G. W. Milligan,et al.  A Study of the Comparability of External Criteria for Hierarchical Cluster Analysis. , 1986, Multivariate behavioral research.

[28]  M. Ashburner,et al.  Systematic determination of patterns of gene expression during Drosophila embryogenesis , 2002, Genome Biology.

[29]  Dennis B. Troup,et al.  NCBI GEO: mining millions of expression profiles—database and tools , 2004, Nucleic Acids Res..

[30]  Laura Maruster,et al.  From data to knowledge: a method for modeling hospital logistic processes , 2005, IEEE Transactions on Information Technology in Biomedicine.

[31]  T. Poggio,et al.  Prediction of central nervous system embryonal tumour outcome based on gene expression , 2002, Nature.

[32]  Camille Roth,et al.  Natural Scales in Geographical Patterns , 2017, Scientific Reports.

[33]  Laurie J. Heyer,et al.  Exploring expression data: identification and analysis of coexpressed genes. , 1999, Genome research.

[34]  G. W. Milligan,et al.  Methodology Review: Clustering Methods , 1987 .

[35]  E. Feuer,et al.  SEER Cancer Statistics Review, 1975-2003 , 2006 .

[36]  L. Wessels,et al.  Lack of Genomic Heterogeneity at High-Resolution aCGH between Primary Breast Cancers and Their Paired Lymph Node Metastases , 2014, PloS one.

[37]  Xavier Robin,et al.  pROC: an open-source package for R and S+ to analyze and compare ROC curves , 2011, BMC Bioinformatics.

[38]  D. N. Sparks Euclidean Cluster Analysis , 1973 .

[39]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[40]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[41]  Pablo Tamayo,et al.  Metagenes and molecular pattern discovery using matrix factorization , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[42]  André Hardy,et al.  An examination of procedures for determining the number of clusters in a data set , 1994 .

[43]  Delbert Dueck,et al.  Clustering by Passing Messages Between Data Points , 2007, Science.

[44]  Fionn Murtagh Expected-Time Complexity Results for Hierarchic Clustering Algorithms Which Use Cluster Centres , 1983, Inf. Process. Lett..

[45]  J. Dunn Well-Separated Clusters and Optimal Fuzzy Partitions , 1974 .

[46]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[47]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[48]  M. Barenco,et al.  Ranked prediction of p53 targets using hidden variable dynamic modeling , 2006, Genome Biology.

[49]  P. Jaccard THE DISTRIBUTION OF THE FLORA IN THE ALPINE ZONE.1 , 1912 .

[50]  Brian Everitt,et al.  Cluster analysis , 1974 .

[51]  Lutgarde M. C. Buydens,et al.  Self- and Super-organizing Maps in R: The kohonen Package , 2007 .

[52]  E. Forgy,et al.  Cluster analysis of multivariate data : efficiency versus interpretability of classifications , 1965 .

[53]  E. Somers International Agency for Research on Cancer. , 1985, CMAJ : Canadian Medical Association journal = journal de l'Association medicale canadienne.

[54]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[55]  Michael Ruogu Zhang,et al.  Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. , 1998, Molecular biology of the cell.

[56]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[57]  M. Cugmas,et al.  On comparing partitions , 2015 .

[58]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[59]  RICHARD C. DUBES,et al.  How many clusters are best? - An experiment , 1987, Pattern Recognit..

[60]  Teuvo Kohonen,et al.  Self-Organizing Maps , 2010 .