Measuring the influence of individual data points in a cluster analysis

The problem of measuring the impact of individual data points in a cluster analysis is examined. The purpose is to identify those data points that have an influence on the resulting cluster partitions. Influence of a single data point is considered present when different cluster partitions result from the removal of the element from the data set. The Hubert and Arabie (1985) corrected Rand index was used to provide numerical measures of influence of a data point. Simulated data sets consisting of a variety of cluster structures and error conditions were generated to validate the influence measures. The results showed that the measure of internal influence was 100% accurate in identifying those data elements exhibiting an influential effect. The nature of the influence, whether beneficial or detrimental to the clustering, can be evaluated with the use of the gamma and point-biserial statistics.

[1]  Leo A. Goodman,et al.  Corrigenda: Measures of Association for Cross Classifications , 1957 .

[2]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[3]  J. Gower,et al.  Minimum Spanning Trees and Single Linkage Cluster Analysis , 1969 .

[4]  T. Crovello Effects of Change of Characters and of Number of Characters in Numerical Taxonomy , 1969 .

[5]  R. M. Cormack,et al.  A Review of Classification , 1971 .

[6]  Michael R. Anderberg,et al.  Cluster Analysis for Applications , 1973 .

[7]  Brian Everitt,et al.  Cluster analysis , 1974 .

[8]  C. Edelbrock Mixture Model Tests Of Hierarchical Clustering Algorithms: The Problem Of Classifying Everybody. , 1979, Multivariate behavioral research.

[9]  L. A. Goodman,et al.  Measures of association for cross classifications , 1979 .

[10]  G. W. Milligan,et al.  An examination of the effect of six types of error perturbation on fifteen clustering algorithms , 1980 .

[11]  Anil K. Jain,et al.  Clustering Methodologies in Exploratory Data Analysis , 1980, Adv. Comput..

[12]  Richard C. Dubes,et al.  Stability of a hierarchical clustering , 1980, Pattern Recognit..

[13]  R. Blashfield,et al.  A Nearest-Centroid Technique for Evaluating the Minimum-Variance Clustering Procedure. , 1980 .

[14]  G. W. Milligan,et al.  A monte carlo study of thirty internal criterion measures for cluster analysis , 1981 .

[15]  A. D. Gordon,et al.  Classification : Methods for the Exploratory Analysis of Multivariate Data , 1981 .

[16]  L C Morey,et al.  A Comparison of Cluster Analysis Techniques Withing a Sequential Validation Framework. , 1983, Multivariate behavioral research.

[17]  G. W. Milligan,et al.  An algorithm for generating artificial test clusters , 1985 .

[18]  A. D. Gordon A Review of Hierarchical Classification , 1987 .

[19]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[20]  J. Breckenridge Replicating Cluster Analysis: Method, Consistency, and Validity. , 1989, Multivariate behavioral research.

[21]  G. W. Milligan,et al.  A validation study of a variable weighting algorithm for cluster analysis , 1989 .

[22]  G. W. Milligan,et al.  A Comparison of Two Approaches to Beta-Flexible Clustering. , 1992, Multivariate behavioral research.

[23]  R. Sokal,et al.  Character and OTU stability in five taxonomic groups , 1992 .

[24]  Paul Lanoie,et al.  A Comparison of Two Approaches , 1995 .

[25]  G. W. Milligan,et al.  Mapping Influence Regions in Heirarchical Clustering. , 1995, Multivariate behavioral research.

[26]  G. W. Milligan,et al.  CLUSTERING VALIDATION: RESULTS AND IMPLICATIONS FOR APPLIED ANALYSES , 1996 .