OCLUS: An Analytic Method for Generating Clusters with Known Overlap

AbstractThe primary method for validating cluster analysis techniques is throughMonte Carlo simulations that rely on generating data with known cluster structure (e.g., Milligan 1996). This paper defines two kinds of data generation mechanisms with cluster overlap, marginal and joint; current cluster generation methods are framed within these definitions. An algorithm generating overlapping clusters based on shared densities from several different multivariate distributions is proposed and shown to lead to an easily understandable notion of cluster overlap. Besides outlining the advantages of generating clusters within this framework, a discussion is given of how the proposed data generation technique can be used to augment research into current classification techniques such as finite mixture modeling, classification algorithm robustness, and latent profile analysis.

[1]  C. Mallows,et al.  A Method for Comparing Two Hierarchical Clusterings , 1983 .

[2]  W. DeSarbo,et al.  Optimal variable weighting for hierarchical clustering: An alternating least-squares algorithm , 1985 .

[3]  Roger K. Blashfield,et al.  Mixture model tests of cluster analysis: Accuracy of four agglomerative hierarchical methods. , 1976 .

[4]  E. L. Lehmann,et al.  Theory of point estimation , 1950 .

[5]  N G Waller,et al.  A Method for Generating Simulated Plasmodes and Artificial Test Clusters with User-Defined Shape, Size, and Orientation. , 1999, Multivariate behavioral research.

[6]  J. Hartigan Testing for Antimodes , 2000 .

[7]  Ali Kara,et al.  HINoV: A New Model to Improve Market Segment Definition by Identifying Noisy Variables , 1999 .

[8]  Varghese S. Jacob,et al.  A study of the classification capabilities of neural networks using unsupervised learning: A comparison withK-means clustering , 1994 .

[9]  H. A. David,et al.  Order Statistics (2nd ed). , 1981 .

[10]  Robert V. Hogg,et al.  Introduction to Mathematical Statistics. , 1966 .

[11]  C. D. Vale,et al.  Simulating multivariate nonnormal distributions , 1983 .

[12]  Paul E. Green,et al.  A Computational Study of Replicated Clustering with an Application to Market Segmentation , 1991 .

[13]  A. D. Gordon A Review of Hierarchical Classification , 1987 .

[14]  Michael R. Anderberg,et al.  Cluster Analysis for Applications , 1973 .

[15]  G H Ball,et al.  A clustering technique for summarizing multivariate data. , 1967, Behavioral science.

[16]  T. Caliński,et al.  A dendrite method for cluster analysis , 1974 .

[17]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[18]  A. Raftery,et al.  Model-based Gaussian and non-Gaussian clustering , 1993 .

[19]  H. Bozdogan,et al.  Multi-sample cluster analysis using Akaike's Information Criterion , 1984 .

[20]  T. Beauchaine,et al.  A comparison of maximum covariance and K-means cluster analysis in classifying cases into known taxon groups. , 2002, Psychological methods.

[21]  P. Arabie,et al.  Indclus: An individual differences generalization of the adclus model and the mapclus algorithm , 1983 .

[22]  W. T. Williams,et al.  A Generalized Sorting Strategy for Computer Classifications , 1966, Nature.

[23]  Charles K. Bayne,et al.  Monte Carlo comparisons of selected clustering procedures , 1980, Pattern Recognit..

[24]  L. Hubert,et al.  A general statistical framework for assessing categorical clustering in free recall. , 1976 .

[25]  W. DeSarbo Gennclus: New models for general nonhierarchical clustering analysis , 1982 .

[26]  Pandu R. Tadikamalla,et al.  On simulating non-normal distributions , 1980 .

[27]  G. W. Milligan,et al.  A Study of the Comparability of External Criteria for Hierarchical Cluster Analysis. , 1986, Multivariate behavioral research.

[28]  G. N. Lance,et al.  A General Theory of Classificatory Sorting Strategies: 1. Hierarchical Systems , 1967, Comput. J..

[29]  Peter J. Rousseeuw,et al.  Robust regression and outlier detection , 1987 .

[30]  G. W. Milligan,et al.  A study of standardization of variables in cluster analysis , 1988 .

[31]  G. W. Milligan,et al.  An algorithm for generating artificial test clusters , 1985 .

[32]  Wayne S. DeSarbo,et al.  Constrained classification: The use of a priori information in cluster analysis , 1984 .

[33]  Douglas Steinley,et al.  Standardizing Variables in K -means Clustering , 2004 .

[34]  M. Brusco,et al.  A variable-selection heuristic for K-means clustering , 2001 .

[35]  Louis L. McQuitty,et al.  Hierarchical Linkage Analysis for the Isolation of Types , 1960 .

[36]  G. W. Milligan,et al.  The validation of four ultrametric clustering algorithms , 1980, Pattern Recognit..

[37]  M. Evans Statistical Distributions , 2000 .

[38]  Robert S. Atlas,et al.  Comparative evaluation of two superior stopping rules for hierarchical cluster analysis , 1994 .

[39]  P. Sneath The application of computers to taxonomy. , 1957, Journal of general microbiology.

[40]  G. W. Milligan,et al.  CLUSTERING VALIDATION: RESULTS AND IMPLICATIONS FOR APPLIED ANALYSES , 1996 .

[41]  Edwin Diday,et al.  Orders and overlapping clusters by pyramids , 1987 .

[42]  Brian Everitt,et al.  Cluster analysis , 1974 .

[43]  E. Mark Gold Flange Detection Cluster Analysis. , 1976 .

[44]  Pierre Hansen,et al.  Partitioning Problems in Cluster Analysis: A Review of Mathematical Programming Approaches , 1994 .

[45]  Douglas Steinley,et al.  Local optima in K-means clustering: what you don't know may hurt you. , 2003, Psychological methods.

[46]  L. Hubert,et al.  Measuring the Power of Hierarchical Cluster Analysis , 1975 .

[47]  E Mark Gold Flange Detection Cluster Analysis. , 1976, Multivariate behavioral research.

[48]  A Gordon,et al.  Classification, 2nd Edition , 1999 .

[49]  G. Soete Optimal variable weighting for ultrametric and additive tree clustering , 1986 .

[50]  Roger N. Shepard,et al.  Additive clustering: Representation of similarities as combinations of discrete overlapping properties. , 1979 .

[51]  R. Blashfield,et al.  A Nearest-Centroid Technique for Evaluating the Minimum-Variance Clustering Procedure. , 1980 .

[52]  Robin Sibson,et al.  The Construction of Hierarchic and Non-Hierarchic Classifications , 1968, Comput. J..

[53]  E. Diday,et al.  AN EXTENSION OF HIERARCHICAL CLUSTERING : THE PYRAMIDAL PRESENTATION , 1986 .

[54]  R. Mojena,et al.  Hierarchical Grouping Methods and Stopping Rules: An Evaluation , 1977, Comput. J..

[55]  H. Bozdogan Choosing the Number of Component Clusters in the Mixture-Model Using a New Informational Complexity Criterion of the Inverse-Fisher Information Matrix , 1993 .

[56]  Werner A. Stahel,et al.  Robust Statistics: The Approach Based on Influence Functions , 1987 .

[57]  Jon R. Kettenring,et al.  Variable selection in clustering and other contexts , 1987 .

[58]  J. Hartigan,et al.  The runt test for multimodality , 1992 .

[59]  Robert R. Sokal,et al.  A statistical method for evaluating systematic relationships , 1958 .

[60]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[61]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[62]  Stanley L. Sclove,et al.  Correction to “Multi-sample cluster analysis using Akaike's information criterion” , 1984 .

[63]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[64]  L. Fisher,et al.  391: A Monte Carlo Comparison of Six Clustering Procedures , 1975 .

[65]  R. M. Cormack,et al.  A Review of Classification , 1971 .

[66]  P. Bickel,et al.  Mathematical Statistics: Basic Ideas and Selected Topics , 1977 .

[67]  Niels G. Waller,et al.  A comparison of the classification capabilities of the 1-dimensional kohonen neural network with two pratitioning and three hierarchical cluster analysis algorithms , 1998 .

[68]  G. W. Milligan,et al.  The Effect of Cluster Size, Dimensionality, and the Number of Clusters on Recovery of True Cluster Structure , 1983, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[69]  Lydia J. Price Identifying cluster overlap with NORMIX population membership probabilities , 1993 .

[70]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[71]  Allen I. Fleishman A method for simulating non-normal distributions , 1978 .