Adaptive Hierarchical Clustering Schemes

Rohlf, F. J. (Biological Sciences, State Univ., Stony Brook, N. Y. 11790) 1970. Adaptive hierarchical clustering schemes. Syst. Zool., 18:58-82.-Various methods of summarizing phenetic relationships are briefly reviewed (including a comparison of principal components analysis and non-metric scaling). Sequential agglomerative hierarchical clustering schemes are considered in particular detail, and several new methods are proposed. The new algorithms are characterized by their ability to "adapt" to the possible trends of variation found within clusters as they are being formed. A nonlinear version allows the isolation and description of clusters which are parabolic, ring-shaped, etc., by the introduction of appropriate dummy variables. Procedures for computing the best fitting trend line through the cluster are also presented, and problems in measuring the amount of information lost by clustering are discussed. [Phenetics; cluster analysis; numerical taxonomy.] This paper is concerned with a brief review of some of the techniques of summarizing phenetic similarities that have been proposed for use in numerical taxonomy. One class of methods (sequential agglomera'tive) is considered in detail and new procedures which allow for elongated and curvilinear clusters are proposed. The "taxonomy problem" in biology can be described as follows: Given a set of specimens ("operational taxonomic units" or OTU's, Sokal and Sneath, 1963, which may represent taxa of any rank) known only by a list of their properties or characters, we wish to find the "best" way of describing their often complex patterns of mutual similarities (phenetic relationships). Such relationships do not necessarily imply evolutionary (cladistic) relationships (for a discussion of these approaches, see Sokal and Camin, 1965). The methods that have been developed appear to have a more general application than just in biological taxonomy, but there are certain facts and assumptions that can be made in biology which influence our choice of methods. As a result, the techniques may or may not be completely valid in other fields. Some of the considerations which influence the development of cluster analyses in biological taxonomy are the following: (1) "All things being equal" we would hope that a system of nested clusters would be found. This is due to the fact that evolution is believed usually to be a divergent process and the distribution of OTU's in a phenetic space should to some extent reflect this. There are, of course, exceptions to this overall rule which are very important, such as those provided by hybridization and clinal variation. (2) Another consideration is the nature of the character set representing each OTU. We would like to use a "random sampling of characters" or at least a "representative" sampling of characters. But since different sets of characters seem to yield slightly different systems of relationships (Rohlf, 1963; Ehrlich and Ehrlich, 1967; Michener and Sokal, 1966), biologists may have to get used to the idea of using different classifications, based upon different sets of characters, each best for its own special purpose, with overall similarities based on the total character set available at any one time. (3) The selection of OTU's is also not random. Since we cannot study all organisms, we must select those which are of immediate interest. But even with a specified group of organisms, we usually cannot sample at random. This is so because the distributions of recent (and even fossil) organisms are clumped in a phenetic hyperspace. One needs to pass up many very similar, common specimens to obtain a more interesting sampling of different kinds of organisms. Thus, a preliminary screening of individuals according to their apparent similarities must be made before one can make detailed measurements to analyze their phenetic relationships quan-

[1]  Karl Pearson,et al.  ON THE COEFFICIENT OF RACIAL LIKENESS , 1926 .

[2]  Calyampudi R. Rao,et al.  Advanced Statistical Methods in Biometric Research. , 1953 .

[3]  J. Kruskal On the shortest spanning subtree of a graph and the traveling salesman problem , 1956 .

[4]  R. Prim Shortest connection networks and some generalizations , 1957 .

[5]  R. Sokal,et al.  A QUANTITATIVE APPROACH TO A PROBLEM IN CLASSIFICATION† , 1957, Evolution; International Journal of Organic Evolution.

[6]  W. T. Williams,et al.  Multivariate Methods in Plant Ecology: I. Association-Analysis in Plant Communities , 1959 .

[7]  T. W. Anderson,et al.  An Introduction to Multivariate Statistical Analysis , 1959 .

[8]  T. W. Anderson An Introduction to Multivariate Statistical Analysis , 1959 .

[9]  D J Rogers,et al.  A Computer Program for Classifying Plants. , 1960, Science.

[10]  Robert R. Sokal,et al.  Distance as a Measure of Taxonomic Similarity , 1961 .

[11]  George S. Sebestyen,et al.  Decision-making processes in pattern recognition , 1962 .

[12]  O. Ore Theory of Graphs , 1962 .

[13]  R. P. McDonald,et al.  A general approach to nonlinear factor analysis , 1962 .

[14]  R. Sokal,et al.  THE COMPARISON OF DENDROGRAMS BY OBJECTIVE METHODS , 1962 .

[15]  F. B. Hildebrand Advanced Calculus for Applications , 1962 .

[16]  F. Rohlf Congruence or Larval and Adult Classifications in Aedes (Diptera: Culicidae) , 1963 .

[17]  Raymond E. Bonner,et al.  On Some Clustering Techniques , 1964, IBM J. Res. Dev..

[18]  E. J. Dupraw Non-Linnean Taxonomy , 1964, Nature.

[19]  Methods for Checking the Results of a Numerical Taxonomic Study , 1964 .

[20]  Geoffrey H. Ball,et al.  Data analysis in the social sciences: what about the details? , 1965, AFIPS '65 (Fall, part I).

[21]  Robert R. Sokal,et al.  The Two Taxonomies: Areas of Agreement and Conflict , 1965 .

[22]  J. G. Skellam,et al.  Multivariate Statistical Analysis for Biologists , 1965 .

[23]  R. Sokal,et al.  A METHOD FOR DEDUCING BRANCHING SEQUENCES IN PHYLOGENY , 1965 .

[24]  Some Processes of Numerical Taxonomy in Terms of Distance , 1966 .

[25]  G. Estabrook A mathematical model in graph theory for biological classification. , 1966, Journal of theoretical biology.

[26]  J. Farris ESTIMATION OF CONSERVATISM OF CHARACTERS BY CONSTANCY WITHIN BIOLOGICAL POPULATIONS , 1966, Evolution; international journal of organic evolution.

[27]  R. Sokal,et al.  Two Tests of the Hypothesis of Nonspecificity in the Hoplitis Complex (Hymenoptera: Megachilidae) , 1966 .

[28]  Jerrold Rubin,et al.  An Approach to Organizing Data into Homogeneous Groups , 1966 .

[29]  T. P. Burnaby Growth-Invariant Discriminant Functions and Generalized Distances , 1966 .

[30]  Peter H. A. Sneath,et al.  A Method for Curve Seeking from Scattered Points , 1966, Comput. J..

[31]  Rank order cluster analysis. , 1966, The British journal of mathematical and statistical psychology.

[32]  P. H. A. Sneath,et al.  Some Statistical Problems in Numerical Taxonomy , 1967 .

[33]  A METHOD OF CLUSTER ANALYSIS , 1967 .

[34]  G. N. Lance,et al.  A General Theory of Classificatory Sorting Strategies: 1. Hierarchical Systems , 1967, Comput. J..

[35]  J. Hartigan REPRESENTATION OF SIMILARITY MATRICES BY TREES , 1967 .

[36]  H. P. Friedman,et al.  On Some Invariant Criteria for Grouping Data , 1967 .

[37]  S. C. Johnson Hierarchical clustering schemes , 1967, Psychometrika.

[38]  L. Cavalli-Sforza,et al.  PHYLOGENETIC ANALYSIS: MODELS AND ESTIMATION PROCEDURES , 1967, Evolution; international journal of organic evolution.

[39]  J. Gower A comparison of some methods of cluster analysis. , 1967, Biometrics.

[40]  C. J. Jardine,et al.  The structure and construction of taxonomic hierarchies , 1967 .

[41]  W. W. Moss Some New Analytic and Graphic Approaches to Numerical Taxonomy, with an Example from the Dermanyssidae (ACARI) , 1967 .

[42]  Graeme Bonham-Carter,et al.  Fortran IV program for Q-mode cluster analysis of nonqualitative data using IBM 7090/7094 computers , 1967 .

[43]  Roderick P. McDonald NUMERICAL METHODS FOR POLYNOMIAL MODELS IN NONLINEAR FACTOR ANALYSIS , 1967 .

[44]  P. Ehrlich,et al.  The Phenetic Relationships of the Butterflies I. Adult Taxonomy and the Nonspecificity Hypothesis , 1967 .

[45]  P. Ehrlich,et al.  Evolutionary History and Population Biology , 1967, Nature.

[46]  J. Carmichael,et al.  FINDING NATURAL CLUSTERS , 1968 .

[47]  F. Rohlf Stereograms In Numerical Taxonomy , 1968 .

[48]  R. Sibson,et al.  A model for taxonomy , 1968 .

[49]  Robin Sibson,et al.  The Construction of Hierarchic and Non-Hierarchic Classifications , 1968, Comput. J..

[50]  F. Rohlf,et al.  Tests for Hierarchical Structure in Random Data Sets , 1968 .

[51]  J. Farris On the Cophenetic Correlation Coefficient , 1969 .

[52]  The Second Annual Conference on Numerical Taxonomy , 1969 .

[53]  Calyampudi R. Rao,et al.  Advanced Statistical Methods in Biometric Research. , 1953 .