论文信息 - Adaptive Hierarchical Clustering Schemes

Adaptive Hierarchical Clustering Schemes

Rohlf, F. J. (Biological Sciences, State Univ., Stony Brook, N. Y. 11790) 1970. Adaptive hierarchical clustering schemes. Syst. Zool., 18:58-82.-Various methods of summarizing phenetic relationships are briefly reviewed (including a comparison of principal components analysis and non-metric scaling). Sequential agglomerative hierarchical clustering schemes are considered in particular detail, and several new methods are proposed. The new algorithms are characterized by their ability to "adapt" to the possible trends of variation found within clusters as they are being formed. A nonlinear version allows the isolation and description of clusters which are parabolic, ring-shaped, etc., by the introduction of appropriate dummy variables. Procedures for computing the best fitting trend line through the cluster are also presented, and problems in measuring the amount of information lost by clustering are discussed. [Phenetics; cluster analysis; numerical taxonomy.] This paper is concerned with a brief review of some of the techniques of summarizing phenetic similarities that have been proposed for use in numerical taxonomy. One class of methods (sequential agglomera'tive) is considered in detail and new procedures which allow for elongated and curvilinear clusters are proposed. The "taxonomy problem" in biology can be described as follows: Given a set of specimens ("operational taxonomic units" or OTU's, Sokal and Sneath, 1963, which may represent taxa of any rank) known only by a list of their properties or characters, we wish to find the "best" way of describing their often complex patterns of mutual similarities (phenetic relationships). Such relationships do not necessarily imply evolutionary (cladistic) relationships (for a discussion of these approaches, see Sokal and Camin, 1965). The methods that have been developed appear to have a more general application than just in biological taxonomy, but there are certain facts and assumptions that can be made in biology which influence our choice of methods. As a result, the techniques may or may not be completely valid in other fields. Some of the considerations which influence the development of cluster analyses in biological taxonomy are the following: (1) "All things being equal" we would hope that a system of nested clusters would be found. This is due to the fact that evolution is believed usually to be a divergent process and the distribution of OTU's in a phenetic space should to some extent reflect this. There are, of course, exceptions to this overall rule which are very important, such as those provided by hybridization and clinal variation. (2) Another consideration is the nature of the character set representing each OTU. We would like to use a "random sampling of characters" or at least a "representative" sampling of characters. But since different sets of characters seem to yield slightly different systems of relationships (Rohlf, 1963; Ehrlich and Ehrlich, 1967; Michener and Sokal, 1966), biologists may have to get used to the idea of using different classifications, based upon different sets of characters, each best for its own special purpose, with overall similarities based on the total character set available at any one time. (3) The selection of OTU's is also not random. Since we cannot study all organisms, we must select those which are of immediate interest. But even with a specified group of organisms, we usually cannot sample at random. This is so because the distributions of recent (and even fossil) organisms are clumped in a phenetic hyperspace. One needs to pass up many very similar, common specimens to obtain a more interesting sampling of different kinds of organisms. Thus, a preliminary screening of individuals according to their apparent similarities must be made before one can make detailed measurements to analyze their phenetic relationships quan-

F. Rohlf

[1] Karl Pearson,et al. ON THE COEFFICIENT OF RACIAL LIKENESS , 1926 .

[2] Calyampudi R. Rao,et al. Advanced Statistical Methods in Biometric Research. , 1953 .

[3] J. Kruskal. On the shortest spanning subtree of a graph and the traveling salesman problem , 1956 .

[4] R. Prim. Shortest connection networks and some generalizations , 1957 .

[5] R. Sokal,et al. A QUANTITATIVE APPROACH TO A PROBLEM IN CLASSIFICATION† , 1957, Evolution; International Journal of Organic Evolution.

[6] W. T. Williams,et al. Multivariate Methods in Plant Ecology: I. Association-Analysis in Plant Communities , 1959 .

[7] T. W. Anderson,et al. An Introduction to Multivariate Statistical Analysis , 1959 .

[8] T. W. Anderson. An Introduction to Multivariate Statistical Analysis , 1959 .

[9] D J Rogers,et al. A Computer Program for Classifying Plants. , 1960, Science.

[10] Robert R. Sokal,et al. Distance as a Measure of Taxonomic Similarity , 1961 .

[11] George S. Sebestyen,et al. Decision-making processes in pattern recognition , 1962 .