Common Clustering Algorithms

This chapter surveys common clustering algorithms widely used in the data mining community in light of chemometrics. It starts with taxonomy of clustering algorithms, and discusses two common clustering approaches – partitioning clustering and hierarchical clustering – in detail. Several variants of these clustering methods are presented and their strengths and weaknesses are addressed. This chapter continues to overview hybrid clustering approaches combining partitioning clustering and hierarchical clustering, and concludes with a quick overview on constrained clustering.

[1]  Khaled S. Al-Sultan,et al.  A tabu search-based algorithm for the fuzzy clustering problem , 1997, Pattern Recognit..

[2]  K. Chidananda Gowda,et al.  Divisive clustering of symbolic objects using the concepts of both similarity and dissimilarity , 1995, Pattern Recognit..

[3]  Joshua Zhexue Huang,et al.  Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values , 1998, Data Mining and Knowledge Discovery.

[4]  Alan T. Murray,et al.  Cluster Discovery Techniques for Exploratory Spatial Data Analysis , 1998, Int. J. Geogr. Inf. Sci..

[5]  David L. Dowe,et al.  Point Estimation Using the Kullback-Leibler Loss Function and MML , 1998, PAKDD.

[6]  Hans-Peter Kriegel,et al.  Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications , 1998, Data Mining and Knowledge Discovery.

[7]  Herbert Witte,et al.  Fast vector quantizer on neural clustering networks providing globally optimal cluster solutions , 1996, Proceedings of 13th International Conference on Pattern Recognition.

[8]  Leonid B. Litinskii,et al.  Neural Network Clustering Based on Distances Between Objects , 2006, ICANN.

[9]  Robert R. Sokal,et al.  A statistical method for evaluating systematic relationships , 1958 .

[10]  Stephen Grossberg,et al.  ART 3: Hierarchical search using chemical transmitters in self-organizing pattern recognition architectures , 1990, Neural Networks.

[11]  H. Kuhn An Efficient Algorithm for the Numerical Solution of the Generalized Weber Problem in Spatial Economics , 1992 .

[12]  Michalis Vazirgiannis,et al.  c ○ 2001 Kluwer Academic Publishers. Manufactured in The Netherlands. On Clustering Validation Techniques , 2022 .

[13]  Alan Hutchinson,et al.  Algorithmic Learning , 1994 .

[14]  N. Metropolis,et al.  Equation of State Calculations by Fast Computing Machines , 1953, Resonance.

[15]  Michael J. A. Berry,et al.  Data mining techniques - for marketing, sales, and customer support , 1997, Wiley computer publishing.

[16]  Vladimir Estivill-Castro,et al.  Discovering Associations in Spatial Data - An Efficient Medoid Based Approach , 1998, PAKDD.

[17]  Ickjai Lee,et al.  Multi-Level Clustering and its Visualization for Exploratory Spatial Analysis , 2002, GeoInformatica.

[18]  Satish Rao,et al.  Approximation schemes for Euclidean k-medians and related problems , 1998, STOC '98.

[19]  G. N. Lance,et al.  A general theory of classificatory sorting strategies: II. Clustering systems , 1967, Comput. J..

[20]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[21]  Shokri Z. Selim,et al.  A simulated annealing algorithm for the clustering problem , 1991, Pattern Recognit..

[22]  V. Estivill-Castro,et al.  Argument free clustering for large spatial point-data sets via boundary extraction from Delaunay Diagram , 2002 .

[23]  Michael K. Ng,et al.  Clustering categorical data sets using tabu search techniques , 2002, Pattern Recognit..

[24]  Adrian F. M. Smith,et al.  Bayesian computation via the gibbs sampler and related markov chain monte carlo methods (with discus , 1993 .

[25]  Gérard Govaert,et al.  Gaussian parsimonious clustering models , 1995, Pattern Recognit..

[26]  Pierre Hansen,et al.  Cluster analysis and mathematical programming , 1997, Math. Program..

[27]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[28]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[29]  C. S. Wallace,et al.  Estimation and Inference by Compact Coding , 1987 .

[30]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[31]  R. Haining Spatial Data Analysis in the Social and Environmental Sciences , 1990 .

[32]  Harold W. Kulin,et al.  AN EFFICIENT ALGORITHM FOR THE NUMERICAL SOLUTION OF THE GENERALIZED WEBER PROBLEM IN SPATIAL ECONOMICS , 1962 .

[33]  Khaled S. Al-Sultan,et al.  A Tabu search approach to the clustering problem , 1995, Pattern Recognit..

[34]  Vasudha Bhatnagar,et al.  K-means Clustering Algorithm for Categorical Attributes , 1999, DaWaK.

[35]  Tian Zhang,et al.  BIRCH: A New Data Clustering Algorithm and Its Applications , 1997, Data Mining and Knowledge Discovery.

[36]  G. W. Milligan,et al.  An examination of procedures for determining the number of clusters in a data set , 1985 .

[37]  L. Paul Chew,et al.  Constrained Delaunay triangulations , 1987, SCG '87.

[38]  Teuvo Kohonen,et al.  Self-organized formation of topologically correct feature maps , 2004, Biological Cybernetics.

[39]  Vijay V. Raghavan,et al.  Genetic Algorithm for Clustering with an Ordered Representation , 1991, ICGA.

[40]  Harold W. Kuhn,et al.  A note on Fermat's problem , 1973, Math. Program..

[41]  Endre Boros,et al.  On clustering problems with connected optima in euclidean spaces , 1989, Discret. Math..

[42]  Chandrajit L. Bajaj,et al.  Proving Geometric Algorithm Non-Solvability: An Application of Factoring Polynomials , 1986, J. Symb. Comput..

[43]  Tapio Salakoski,et al.  General formulation and evaluation of agglomerative clustering methods with metric and non-metric distances , 1993, Pattern Recognit..

[44]  M. Jambu,et al.  Cluster analysis and data analysis , 1985 .

[45]  J. C. Peters,et al.  Fuzzy Cluster Analysis : A New Method to Predict Future Cardiac Events in Patients With Positive Stress Tests , 1998 .

[46]  Brian Everitt,et al.  Cluster analysis , 1974 .

[47]  Claire Cardie,et al.  Proceedings of the Eighteenth International Conference on Machine Learning, 2001, p. 577–584. Constrained K-means Clustering with Background Knowledge , 2022 .

[48]  Jonathan M. Garibaldi,et al.  The Application of a Simulated Annealing Fuzzy Clustering Algorithm for Cancer Diagnosis , 2004 .

[49]  Hwei-Jen Lin,et al.  An Efficient GA-based Clustering Technique , 2005 .

[50]  Ickjai Lee,et al.  Clustering with obstacles for Geographical Data Mining , 2004 .

[51]  G. N. Lance,et al.  A General Theory of Classificatory Sorting Strategies: 1. Hierarchical Systems , 1967, Comput. J..

[52]  Andreas Rudolph,et al.  Techniques of Cluster Algorithms in Data Mining , 2002, Data Mining and Knowledge Discovery.

[53]  Peter J. Rousseeuw,et al.  Robust regression and outlier detection , 1987 .

[54]  Donald E. Brown,et al.  A practical application of simulated annealing to clustering , 1990, Pattern Recognit..

[55]  Vladimir Estivill-Castro,et al.  Fast and Robust General Purpose Clustering Algorithms , 2000, Data Mining and Knowledge Discovery.

[56]  Gareth Jones,et al.  Non-hierarchic document clustering using a genetic algorithm , 1995, Information Research.

[57]  M. Aldenderfer Cluster Analysis , 1984 .

[58]  Dan Klein,et al.  From Instance-level Constraints to Space-Level Constraints: Making the Most of Prior Knowledge in Data Clustering , 2002, ICML.

[59]  M. Rao Cluster Analysis and Mathematical Programming , 1971 .

[60]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[61]  Robert J. Schalkoff,et al.  Pattern recognition - statistical, structural and neural approaches , 1991 .

[62]  RICHARD C. DUBES,et al.  How many clusters are best? - An experiment , 1987, Pattern Recognit..

[63]  Ming-Syan Chen,et al.  Combining Partitional and Hierarchical Algorithms for Robust and Efficient Data Clustering with Cohesion Self-Merging , 2005, IEEE Trans. Knowl. Data Eng..

[64]  C. S. Wallace,et al.  Unsupervised Learning Using MML , 1996, ICML.

[65]  John J. Grefenstette,et al.  Optimization of Control Parameters for Genetic Algorithms , 1986, IEEE Transactions on Systems, Man, and Cybernetics.

[66]  Vladimir Estivill-Castro,et al.  Why so many clustering algorithms: a position paper , 2002, SKDD.

[67]  Anil K. Jain,et al.  A self-organizing network for hyperellipsoidal clustering (HEC) , 1996, IEEE Trans. Neural Networks.

[68]  Harry Wechsler,et al.  Tabu search exploration for on-policy reinforcement learning , 2003, Proceedings of the International Joint Conference on Neural Networks, 2003..

[69]  Alex A. Freitas,et al.  A survey of evolutionary algorithms for data mining and knowledge discovery , 2003 .

[70]  A. Raftery,et al.  Model-based Gaussian and non-Gaussian clustering , 1993 .

[71]  Richard C. Dubes,et al.  Experiments in projection and clustering by simulated annealing , 1989, Pattern Recognit..

[72]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[73]  Ujjwal Maulik,et al.  Genetic algorithm-based clustering technique , 2000, Pattern Recognit..

[74]  David B. Dunson,et al.  Bayesian Data Analysis , 2010 .

[75]  T. Kohonen Self-organized formation of topographically correct feature maps , 1982 .

[76]  C. S. Wallace,et al.  An Information Measure for Classification , 1968, Comput. J..

[77]  Paul S. Bradley,et al.  Clustering via Concave Minimization , 1996, NIPS.

[78]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[79]  Michael E. Houle,et al.  Fast Randomized Algorithms for Robust Estimation of Location , 2000, TSDM.

[80]  Vladimir Estivill-Castro,et al.  Clustering Web Visitors by Fast, Robust and Convergent Algorithms , 2002, Int. J. Found. Comput. Sci..

[81]  M. G. Bulmer,et al.  Principles of Statistics. , 1969 .

[82]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[83]  Takio Kurita,et al.  An efficient agglomerative clustering algorithm using a heap , 1991, Pattern Recognit..

[84]  Paul S. Bradley,et al.  Initialization of Iterative Refinement Clustering Algorithms , 1998, KDD.

[85]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[86]  Omid Omidvar,et al.  Neural Networks and Pattern Recognition , 1997 .

[87]  Aidong Zhang,et al.  WaveCluster: a wavelet-based clustering approach for spatial data in very large databases , 2000, The VLDB Journal.

[88]  Ricardo A. Baeza-Yates,et al.  Introduction to Data Structures and Algorithms Related to Information Retrieval , 1992, Information Retrieval: Data Structures & Algorithms.

[89]  Dimitrios Gunopulos,et al.  Automatic Subspace Clustering of High Dimensional Data , 2005, Data Mining and Knowledge Discovery.

[90]  M. Narasimha Murty,et al.  A hybrid clustering procedure for concentric and chain-like clusters , 1981, International Journal of Computer & Information Sciences.

[91]  Fionn Murtagh,et al.  A Survey of Recent Advances in Hierarchical Clustering Algorithms , 1983, Comput. J..

[92]  P. Sneath The application of computers to taxonomy. , 1957, Journal of general microbiology.

[93]  Xue-Ming Li,et al.  A hybrid genetic based clustering algorithm , 2004, Proceedings of 2004 International Conference on Machine Learning and Cybernetics (IEEE Cat. No.04EX826).

[94]  Shokri Z. Selim,et al.  K-Means-Type Algorithms: A Generalized Convergence Theorem and Characterization of Local Optimality , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[95]  David L. Dowe,et al.  Minimum Message Length and Kolmogorov Complexity , 1999, Comput. J..

[96]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[97]  Jiawei Han,et al.  CLARANS: A Method for Clustering Objects for Spatial Data Mining , 2002, IEEE Trans. Knowl. Data Eng..

[98]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[99]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[100]  Donald R. Jones,et al.  Solving Partitioning Problems with Genetic Algorithms , 1991, International Conference on Genetic Algorithms.

[101]  Ickjai Lee,et al.  Argument free clustering via boundary extraction for massive point-data Sets , 2002 .

[102]  Steven F. Arnold 18 Gibbs sampling , 1993, Computational Statistics.

[103]  M. Sato-Ilic,et al.  Non-metric neural clustering , 1999, ICONIP'99. ANZIIS'99 & ANNES'99 & ACNN'99. 6th International Conference on Neural Information Processing. Proceedings (Cat. No.99EX378).

[104]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[105]  Chi-Hoon Lee,et al.  Clustering spatial data when facing physical constraints , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[106]  Thomas Martinetz,et al.  'Neural-gas' network for vector quantization and its application to time-series prediction , 1993, IEEE Trans. Neural Networks.

[107]  Michael L. Overton,et al.  A quadratically convergent method for minimizing a sum of euclidean norms , 1983, Math. Program..

[108]  Jiong Yang,et al.  An Approach to Active Spatial Data Mining Based on Statistical Information , 2000, IEEE Trans. Knowl. Data Eng..

[109]  James C. Bezdek,et al.  Generalized clustering networks and Kohonen's self-organizing scheme , 1993, IEEE Trans. Neural Networks.

[110]  Christopher M. Bishop,et al.  Neural networks for pattern recognition , 1995 .

[111]  Joydeep Ghosh,et al.  A Unified Framework for Model-based Clustering , 2003, J. Mach. Learn. Res..

[112]  Hrishikesh D. Vinod Mathematica Integer Programming and the Theory of Grouping , 1969 .

[113]  Umeshwar Dayal,et al.  K-Harmonic Means - A Spatial Clustering Algorithm with Boosting , 2000, TSDM.

[114]  Zhexue Huang,et al.  CLUSTERING LARGE DATA SETS WITH MIXED NUMERIC AND CATEGORICAL VALUES , 1997 .

[115]  Paul S. Bradley,et al.  Scaling Clustering Algorithms to Large Databases , 1998, KDD.

[116]  George O. Wesolowsky,et al.  THE WEBER PROBLEM: HISTORY AND PERSPECTIVES. , 1993 .

[117]  Michael E. Houle,et al.  Data Structures for Minimization of Total Within-Group Distance for Spatio-temporal Clustering , 2001, PKDD.

[118]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[119]  Michael E. Houle,et al.  Robust Distance-Based Clustering with Applications to Spatial Data Mining , 2001, Algorithmica.

[120]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[121]  Vijay V. Raghavan,et al.  A clustering strategy based on a formalism of the reproductive process in natural systems , 1979, SIGIR 1979.

[122]  J. Gower A comparison of some methods of cluster analysis. , 1967, Biometrics.

[123]  Andrew L. Rukhin,et al.  Tools for statistical inference , 1991 .

[124]  Adrian E. Raftery,et al.  How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis , 1998, Comput. J..

[125]  Erik K. Antonsson,et al.  Dynamic partitional clustering using evolution strategies , 2000, 2000 26th Annual Conference of the IEEE Industrial Electronics Society. IECON 2000. 2000 IEEE International Conference on Industrial Electronics, Control and Instrumentation. 21st Century Technologies.

[126]  Massimo Paolucci,et al.  A New Modeling Technique Based on Markov Chains to Mine Behavioral Patterns in Event Based Time Series , 1999, DaWaK.