Hierarchical clustering of variables: a comparison among strategies of analysis

In this paper some hierarchical methods for identifying groups of variables are illustrated and compared. It is shown that the use of multivariate association measures between two sets of variables can overcome the drawbacks of the usually employed bivariate correlation coefficient, but the resulting methods are generally not monotonic. Thus a new multivariate association measure is proposed, based on the links existing between canonical correlation analysis and principal component analysis, which can be more suitably used for the purpose at hand. The hierarchical method based on the suggested measure is illustrated and compared with other possible solutions by analysing simulated and real data sets. Finally an extension of the suggested method to the more general situation of mixed (qualitative and quantitative) variables is proposed and theoretically discussed.

[1]  William W. Rozeboom,et al.  Linear correlations between sets of variables , 1965, Psychometrika.

[2]  R. Sibson Studies in the Robustness of Multidimensional Scaling: Procrustes Statistics , 1978 .

[3]  J. Gower,et al.  Metric and Euclidean properties of dissimilarity coefficients , 1986 .

[4]  M. Kendall,et al.  The advanced theory of statistics , 1945 .

[5]  Y. Escoufier LE TRAITEMENT DES VARIABLES VECTORIELLES , 1973 .

[6]  P. Groenen,et al.  Modern multidimensional scaling , 1996 .

[7]  E. M. Cramer,et al.  A GENERALIZATION OF VECTOR CORRELATION AND ITS RELATION TO CANONICAL CORRELATION. , 1974, Multivariate behavioral research.

[8]  A. C. Rencher Methods of multivariate analysis , 1995 .

[9]  J. Meulman A Distance Approach to Nonlinear Multivariate Analysis , 1986 .

[10]  G. N. Lance,et al.  A General Theory of Classificatory Sorting Strategies: 1. Hierarchical Systems , 1967, Comput. J..

[11]  J. Hartigan Direct Clustering of a Data Matrix , 1972 .

[12]  L. Hubert,et al.  The Data Theory Scaling System , 1998 .

[13]  W. Alan Nicewander,et al.  Some symmetric, invariant measures of multivariate association , 1979 .

[14]  H. Harman Modern factor analysis , 1961 .

[15]  Angela Montanari,et al.  Analysing Dissimilarities through Multigraphs , 1999 .

[16]  David Wishart,et al.  256 NOTE: An Algorithm for Hierarchical Classifications , 1969 .

[17]  Peter Coxhead,et al.  MEASURING THE RELATIONSHIP BETWEEN TWO SETS OF VARIABLES , 1974 .

[18]  Juliet Popper Shaffer,et al.  A Multivariate Extension of the Correlation Ratio , 1974 .

[19]  P. Robert,et al.  A Unifying Tool for Linear Multivariate Statistical Methods: The RV‐Coefficient , 1976 .

[20]  Points of view analysis revisited: Fitting multidimensional structures to optimal distance components with cluster restrictions on the variables , 1993 .

[21]  J. Gower Some distance properties of latent root and vector methods used in multivariate analysis , 1966 .

[22]  Ian T. Jolliffe,et al.  Discarding Variables in a Principal Component Analysis. I: Artificial Data , 1972 .

[23]  Michael R. Anderberg,et al.  Cluster Analysis for Applications , 1973 .

[24]  C. M. Cuadras,et al.  A distance based regression model for prediction with mixed data , 1990 .

[25]  J. Meulman A Distance-Based Biplot for Multidimensional Scaling of Multivariate Data , 1998 .

[26]  J. Gower A General Coefficient of Similarity and Some of Its Properties , 1971 .

[27]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[28]  Keith E. Muller,et al.  Understanding Canonical Correlation through the General Linear Model and Principal Components , 1982 .

[29]  J. Meulman The integration of multidimensional scaling and multivariate analysis with optimal transformations , 1992 .

[30]  H. Hotelling Relations Between Two Sets of Variates , 1936 .

[31]  J. A. Hartigan,et al.  Modal Blocks in Dentition of West Coast Mammals , 1976 .

[32]  Brian Everitt,et al.  Cluster analysis , 1974 .

[33]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .