A variable-selection heuristic for K-means clustering

One of the most vexing problems in cluster analysis is the selection and/or weighting of variables in order to include those that truly define cluster structure, while eliminating those that might mask such structure. This paper presents a variable-selection heuristic for nonhierarchical (K-means) cluster analysis based on the adjusted Rand index for measuring cluster recovery. The heuristic was subjected to Monte Carlo testing across more than 2200 datasets with known cluster structure. The results indicate the heuristic is extremely effective at eliminating masking variables. A cluster analysis of real-world financial services data revealed that using the variable-selection heuristic prior to the K-means algorithm resulted in greater cluster stability.

[1]  M. E. Muller,et al.  A Note on the Generation of Random Normal Deviates , 1958 .

[2]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[3]  H. P. Friedman,et al.  On Some Invariant Criteria for Grouping Data , 1967 .

[4]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[5]  F. Rohlf Adaptive Hierarchical Clustering Schemes , 1970 .

[6]  R. M. Cormack,et al.  A Review of Classification , 1971 .

[7]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[8]  Michael R. Anderberg,et al.  Cluster Analysis for Applications , 1973 .

[9]  G. W. Milligan,et al.  An examination of the effect of six types of error perturbation on fifteen clustering algorithms , 1980 .

[10]  R. Blashfield,et al.  A Nearest-Centroid Technique for Evaluating the Minimum-Variance Clustering Procedure. , 1980 .

[11]  C. L. Mallows,et al.  A Method for Comparing Two Hierarchical Clusterings: Rejoinder , 1983 .

[12]  G. W. Milligan,et al.  The Effect of Cluster Size, Dimensionality, and the Number of Clusters on Recovery of True Cluster Structure , 1983, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  L C Morey,et al.  A Comparison of Cluster Analysis Techniques Withing a Sequential Validation Framework. , 1983, Multivariate behavioral research.

[14]  C. Mallows,et al.  A Method for Comparing Two Hierarchical Clusterings , 1983 .

[15]  J. Carroll,et al.  Synthesized clustering: A method for amalgamating alternative clustering bases with differential weighting of variables , 1984 .

[16]  W. DeSarbo,et al.  Optimal variable weighting for hierarchical clustering: An alternating least-squares algorithm , 1985 .

[17]  G. W. Milligan,et al.  An algorithm for generating artificial test clusters , 1985 .

[18]  G. W. Milligan,et al.  A Study of the Comparability of External Criteria for Hierarchical Cluster Analysis. , 1986, Multivariate behavioral research.

[19]  G. Soete Optimal variable weighting for ultrametric and additive tree clustering , 1986 .

[20]  Jon R. Kettenring,et al.  Variable selection in clustering and other contexts , 1987 .

[21]  E. Fowlkes,et al.  Variable selection in clustering , 1988 .

[22]  G. W. Milligan,et al.  A study of standardization of variables in cluster analysis , 1988 .

[23]  G. Soete OVWTRE: A program for optimal variable weighting for ultrametric and additive tree fitting , 1988 .

[24]  J. Breckenridge Replicating Cluster Analysis: Method, Consistency, and Validity. , 1989, Multivariate behavioral research.

[25]  G. W. Milligan,et al.  A validation study of a variable weighting algorithm for cluster analysis , 1989 .

[26]  P. Green,et al.  A preliminary study of optimal variable weighting in k-means clustering , 1990 .

[27]  Paul E. Green,et al.  A Computational Study of Replicated Clustering with an Application to Market Segmentation , 1991 .

[28]  Varghese S. Jacob,et al.  A study of the classification capabilities of neural networks using unsupervised learning: A comparison withK-means clustering , 1994 .

[29]  P. Arabie,et al.  Cluster analysis in marketing research , 1994 .

[30]  John D. C. Little,et al.  The Marketing Information Revolution , 1994 .

[31]  R. Gnanadesikan,et al.  Weighting and selection of variables for cluster analysis , 1995 .

[32]  G. W. Milligan,et al.  CLUSTERING VALIDATION: RESULTS AND IMPLICATIONS FOR APPLIED ANALYSES , 1996 .

[33]  Phipps Arabie,et al.  AN OVERVIEW OF COMBINATORIAL DATA ANALYSIS , 1996 .

[34]  Robert Saltstone,et al.  A computer program to calculate Hubert and Arabie's adjusted rand index , 1996 .

[35]  G. Milligan,et al.  K-Means Clustering Methods with Influence Detection , 1996 .

[36]  J. Carroll,et al.  A Feature-Based Approach to Market Segmentation via Overlapping K-Centroids Clustering , 1997 .

[37]  Michael J. A. Berry,et al.  Data mining techniques - for marketing, sales, and customer support , 1997, Wiley computer publishing.

[38]  M. Wedel,et al.  Market Segmentation: Conceptual and Methodological Foundations , 1997 .

[39]  Michel Wedel,et al.  Modeling large data sets in marketing , 2001 .

[40]  Niels G. Waller,et al.  A comparison of the classification capabilities of the 1-dimensional kohonen neural network with two pratitioning and three hierarchical cluster analysis algorithms , 1998 .

[41]  P. Green,et al.  A Generalized Rand-Index Method for Consensus Clustering of Separate Partitions of the Same Data Base , 1999 .

[42]  Ali Kara,et al.  HINoV: A New Model to Improve Market Segment Definition by Identifying Noisy Variables , 1999 .