General formulation and evaluation of agglomerative clustering methods with metric and non-metric distances

Abstract Agglomerative clustering methods with stopping criteria are generalized. Clustering-related concepts are rigorously formulated with special consideration on metricity of object space. A new definition of combinatoriality is given, and a stronger proposition of monotonicity is proven. Specializations of the general method are applied to non-attributive non-metric and attributive pseudometric representations of biosequences. The furthest neighbor method is shown suitable for non-metric use. In metric object space, four inter-clusteral distance functions, including a new truly context sensitive method, are compared using a method-independent goodness criterion. For biosequence clustering, the new method overcomes the UPGMA, UPGMC, and furthest neighbor methods.

[1]  Rainer Fuchs,et al.  CLUSTAL V: improved software for multiple sequence alignment , 1992, Comput. Appl. Biosci..

[2]  G. N. Lance,et al.  A general theory of classificatory sorting strategies: II. Clustering systems , 1967, Comput. J..

[3]  R. Doolittle Of urfs and orfs : a primer on how to analyze devised amino acid sequences , 1986 .

[4]  R. Doolittle Molecular evolution: computer analysis of protein and nucleic acid sequences. , 1990, Methods in enzymology.

[5]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[6]  E. Patrick,et al.  Fundamentals of Pattern Recognition , 1973 .

[7]  D. Lipman,et al.  Rapid similarity searches of nucleic acid and protein data banks. , 1983, Proceedings of the National Academy of Sciences of the United States of America.

[8]  M. Klapper,et al.  The independent distribution of amino acid near neighbor pairs into polypeptides. , 1977, Biochemical and biophysical research communications.

[9]  David J. Hand,et al.  Discrimination and Classification , 1982 .

[10]  Kathryn E. Sidman,et al.  The protein identification resource (PIR). , 1986, Nucleic acids research.

[11]  W. Kabsch,et al.  Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical features , 1983, Biopolymers.

[12]  G. Barton Protein multiple sequence alignment and flexible pattern matching. , 1990, Methods in enzymology.

[13]  G. N. Lance,et al.  A General Theory of Classificatory Sorting Strategies: 1. Hierarchical Systems , 1967, Comput. J..

[14]  M S Waterman,et al.  Multiple sequence alignment by consensus. , 1986, Nucleic acids research.

[15]  P R Krause,et al.  A review of algorithms for molecular sequence comparison. , 1991, Computers and biomedical research, an international journal.

[16]  K Nishikawa,et al.  The folding type of a protein is relevant to the amino acid composition. , 1986, Journal of biochemistry.

[17]  M. O. Dayhoff,et al.  Atlas of protein sequence and structure , 1965 .

[18]  T. Creighton Proteins: Structures and Molecular Properties , 1986 .

[19]  Naresh C. Jain,et al.  Monte Carlo comparison of six hierarchical clustering methods on random data , 1986, Pattern Recognit..

[20]  David Sankoff,et al.  Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison , 1983 .

[21]  D. Lipman,et al.  Rapid and sensitive protein similarity searches. , 1985, Science.

[22]  T. Salakoski,et al.  Selection of a representative set of structures from brookhaven protein data bank , 1992, Proteins.

[23]  A. Lesk COMPUTATIONAL MOLECULAR BIOLOGY , 1988, Proceeding of Data For Discovery.

[24]  D. Hand Cluster dissection and analysis: Helmuth SPATH Wiley, Chichester, 1985, 226 pages, £25.00 , 1986 .

[25]  G. H. Hamm,et al.  The EMBL data library , 1993, Nucleic Acids Res..

[26]  Michael S. Waterman,et al.  General methods of sequence comparison , 1984 .

[27]  M F Janowitz Cluster Analysis Algorithms for Image Segmentation. , 1981 .

[28]  J. Devereux,et al.  A comprehensive set of sequence analysis programs for the VAX , 1984, Nucleic Acids Res..

[29]  D. Higgins,et al.  See Blockindiscussions, Blockinstats, Blockinand Blockinauthor Blockinprofiles Blockinfor Blockinthis Blockinpublication Clustal: Blockina Blockinpackage Blockinfor Blockinperforming Multiple Blockinsequence Blockinalignment Blockinon Blockina Minicomputer Article Blockin Blockinin Blockin , 2022 .

[30]  M Vihinen Simultaneous comparison of several sequences. , 1990, Methods in enzymology.

[31]  Charles K. Bayne,et al.  Monte Carlo comparisons of selected clustering procedures , 1980, Pattern Recognit..

[32]  U. Hobohm,et al.  Selection of representative protein data sets , 1992, Protein science : a publication of the Protein Society.

[33]  Fionn Murtagh,et al.  Cluster Dissection and Analysis: Theory, Fortran Programs, Examples. , 1986 .

[34]  Charles T. Zahn,et al.  Graph-Theoretical Methods for Detecting and Describing Gestalt Clusters , 1971, IEEE Transactions on Computers.

[35]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[36]  H. M. Martinez,et al.  A multiple sequence alignment program , 1986, Nucleic Acids Res..

[37]  Roderick Urquhart,et al.  Graph theoretical clustering based on limited neighbourhood sets , 1982, Pattern Recognit..

[38]  M. Murata,et al.  Three-way Needleman--Wunsch algorithm. , 1990, Methods in enzymology.

[39]  Olli Nevalainen,et al.  MULTICOMP: a program package for multiple sequence comparison , 1992, Comput. Appl. Biosci..

[40]  M. Eigen,et al.  Statistical geometry on sequence space. , 1990, Methods in enzymology.

[41]  Temple F. Smith,et al.  Comparison of biosequences , 1981 .

[42]  Takio Kurita,et al.  An efficient agglomerative clustering algorithm using a heap , 1991, Pattern Recognit..

[43]  H. Bock,et al.  Structures and Molecular Properties of Sterically Overcrowded Molecules. Part 36. (E)-1,1,1,4,4,4-Hexakis(trimethylsilyl)-2-butene (( H3C)3Si)3C-HC=CH-C(Si(CH3)3)3 and Bis(tris(trimethylsilyl)silyl)ethin ( (H3C)3Si)3Si-CC-Si(Si(CH3)3)3 , 1994 .

[44]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[45]  K. Rohde,et al.  A multiple alignment program for protein sequences , 1987, Comput. Appl. Biosci..

[46]  John C. Ogilvie,et al.  Evaluation of hierarchical grouping techniques; a preliminary study , 1972, Comput. J..

[47]  P Argos,et al.  Sensitivity comparison of protein amino acid sequences. , 1990, Methods in enzymology.

[48]  Jaap Heringa,et al.  OBSTRUCT: a program to obtain largest cliques from a protein sequence set according to structural resolution and sequence similarity , 1992, Comput. Appl. Biosci..

[49]  G. W. Milligan,et al.  The validation of four ultrametric clustering algorithms , 1980, Pattern Recognit..