PFClust: a novel parameter free clustering algorithm

BackgroundWe present the algorithm PFClust (Parameter Free Clustering), which is able automatically to cluster data and identify a suitable number of clusters to group them into without requiring any parameters to be specified by the user. The algorithm partitions a dataset into a number of clusters that share some common attributes, such as their minimum expectation value and variance of intra-cluster similarity. A set of n objects can be clustered into any number of clusters from one to n, and there are many different hierarchical and partitional, agglomerative and divisive, clustering methodologies available that can be used to do this. Nonetheless, automatically determining the number of clusters present in a dataset constitutes a significant challenge for clustering algorithms. Identifying a putative optimum number of clusters to group the objects into involves computing and evaluating a range of clusterings with different numbers of clusters. However, there is no agreed or unique definition of optimum in this context. Thus, we test PFClust on datasets for which an external gold standard of ‘correct’ cluster definitions exists, noting that this division into clusters may be suboptimal according to other reasonable criteria. PFClust is heuristic in the sense that it cannot be described in terms of optimising any single simply-expressed metric over the space of possible clusterings.ResultsWe validate PFClust firstly with reference to a number of synthetic datasets consisting of 2D vectors, showing that its clustering performance is at least equal to that of six other leading methodologies - even though five of the other methods are told in advance how many clusters to use. We also demonstrate the ability of PFClust to classify the three dimensional structures of protein domains, using a set of folds taken from the structural bioinformatics database CATH.ConclusionsWe show that PFClust is able to cluster the test datasets a little better, on average, than any of the other algorithms, and furthermore is able to do this without the need to specify any external parameters. Results on the synthetic datasets demonstrate that PFClust generates meaningful clusters, while our algorithm also shows excellent agreement with the correct assignments for a dataset extracted from the CATH part-manually curated classification of protein domain structures.

[1]  Lazaros Mavridis,et al.  Pacific Symposium on Biocomputing 15:281-292(2010) 3D-BLAST: 3D PROTEIN STRUCTURE ALIGNMENT, COMPARISON, AND CLASSIFICATION USING SPHERICAL POLAR FOURIER CORRELATIONS , 2022 .

[2]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[3]  Timothy J. Harlow,et al.  A hybrid clustering approach to recognition of protein families in 114 microbial genomes , 2004, BMC Bioinformatics.

[4]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[5]  Raffaele Giancarlo,et al.  Computational cluster validation for microarray data analysis: experimental assessment of Clest, Consensus Clustering, Figure of Merit, Gap Statistics and Model Explorer , 2008, BMC Bioinformatics.

[6]  Jian Zhang,et al.  The Protein Information Resource: an integrated public resource of functional annotation of proteins , 2002, Nucleic Acids Res..

[7]  Christos Faloutsos,et al.  PICS: Parameter-free Identification of Cohesive Subgroups in Large Attributed Graphs , 2012, SDM.

[8]  B. Jayaram,et al.  Proteins: sequence to structure and function--current status. , 2010, Current protein & peptide science.

[9]  A. Rosenfeld,et al.  IEEE TRANSACTIONS ON SYSTEMS , MAN , AND CYBERNETICS , 2022 .

[10]  Ian Sillitoe,et al.  The CATH classification revisited—architectures reviewed and new ways to characterize structural divergence in superfamilies , 2008, Nucleic Acids Res..

[11]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[12]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[13]  D. Ritchie,et al.  Protein docking using spherical polar Fourier correlations , 2000, Proteins.

[14]  Luigi Cinque,et al.  Image thresholding using fuzzy entropies , 1998, IEEE Trans. Syst. Man Cybern. Part B.

[15]  Anil K. Jain Data clustering: 50 years beyond K-means , 2008, Pattern Recognit. Lett..

[16]  E. Birney,et al.  Pfam: the protein families database , 2013, Nucleic Acids Res..

[17]  Cor J. Veenman,et al.  A Maximum Variance Cluster Algorithm , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[18]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[19]  J. C. Dunn,et al.  A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters , 1973 .

[20]  Chih-Ping Wei,et al.  Empirical comparison of fast clustering algorithms for large data sets , 2000, Proceedings of the 33rd Annual Hawaii International Conference on System Sciences.

[21]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[22]  H. Berman The Protein Data Bank: a historical perspective. , 2008, Acta crystallographica. Section A, Foundations of crystallography.

[23]  James C. Bezdek,et al.  Some new indexes of cluster validity , 1998, IEEE Trans. Syst. Man Cybern. Part B.

[24]  Adrian E. Raftery,et al.  Model-Based Clustering, Discriminant Analysis, and Density Estimation , 2002 .

[25]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[26]  Chih-Ping Wei,et al.  Empirical comparison of fast partitioning-based clustering algorithms for large data sets , 2003, Expert Syst. Appl..

[27]  Douglas B. Kell,et al.  Computational cluster validation in post-genomic data analysis , 2005, Bioinform..

[28]  Lazaros Mavridis,et al.  Representing and comparing protein folds and fold families using three‐dimensional shape‐density representations , 2012, Proteins.

[29]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[30]  Xiaojin Zhu,et al.  --1 CONTENTS , 2006 .

[31]  L. Holm,et al.  The Pfam protein families database , 2005, Nucleic Acids Res..

[32]  G. N. Lance,et al.  A General Theory of Classificatory Sorting Strategies: 1. Hierarchical Systems , 1967, Comput. J..

[33]  Parag Kulkarni,et al.  A Survey of Semi-Supervised Learning Methods , 2008, 2008 International Conference on Computational Intelligence and Security.

[34]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.