Representative noise-free complete-link classification with application to protein structures

In various applications, including many problems of knowledge discovery in databases, and particularly in the field of computational molecular biology, a compact and representative description of a vast object space is desired. In this paper, a constructive mathematical model corresponding the intuitive requirements of representativity is developed. Representativity is divided into two aspects: typicality and comprehensiveness. A new sieving method is presented where a special kind of noise is detected and eliminated by removing anomalous objects from the initial complete linkage partition. The comprehensiveness endangered by sieving is then regained by applying a special completion procedure. Theoretical results ensure that the resulting partition is representative, consisting of solid and separable classes. The conceptual model was further tested by applying the method to protein amino acid sequences of the Brookhaven Protein Data Bank. The recognized biochemical substance of the outcome confirm the representativity of the resulting classification.

[1]  T. Salakoski,et al.  Representative selection of proteins based on nuclear families. , 1995, Protein engineering.

[2]  P. Koehl,et al.  Polar and nonpolar atomic environments in the protein core: Implications for folding and binding , 1994, Proteins.

[3]  G. N. Lance,et al.  A General Theory of Classificatory Sorting Strategies: 1. Hierarchical Systems , 1967, Comput. J..

[4]  T. P. Flores,et al.  Identification and classification of protein fold families. , 1993, Protein engineering.

[5]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[6]  Isaac Weiss,et al.  Straight line fitting in a noisy image , 1988, Proceedings CVPR '88: The Computer Society Conference on Computer Vision and Pattern Recognition.

[7]  P. Koehl,et al.  Atomic environment energies in proteins defined from statistics of accessible and contact surface areas. , 1995, Journal of molecular biology.

[8]  A. Karshikoff,et al.  The optimization of protein‐solvent interactions: Thermostability and the role of hydrophobic and electrostatic interactions , 1995, Protein science : a publication of the Protein Society.

[9]  Rajesh N. Davé,et al.  Characterization and detection of noise in clustering , 1991, Pattern Recognit. Lett..

[10]  Jaap Heringa,et al.  OBSTRUCT: a program to obtain largest cliques from a protein sequence set according to structural resolution and sequence similarity , 1992, Comput. Appl. Biosci..

[11]  M. Vihinen Modeling of prostate specific antigen and human glandular kallikrein structures. , 1994, Biochemical and biophysical research communications.

[12]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[13]  Michael S. Waterman,et al.  General methods of sequence comparison , 1984 .

[14]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[15]  Arthur M. Lesk Computational Molecular Biology: Sources and Methods for Sequence Analysis , 1989 .

[16]  J. Sowadski,et al.  Structural basis for chromosome X-linked agammaglobulinemia: a tyrosine kinase disease. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[17]  Charles K. Bayne,et al.  Monte Carlo comparisons of selected clustering procedures , 1980, Pattern Recognit..

[18]  T. Salakoski,et al.  Selection of a representative set of structures from brookhaven protein data bank , 1992, Proteins.

[19]  M Vihinen,et al.  Accurate prediction of protein secondary structural class with fuzzy structural vectors. , 1995, Protein engineering.

[20]  Naresh C. Jain,et al.  Monte Carlo comparison of six hierarchical clustering methods on random data , 1986, Pattern Recognit..

[21]  D. Lipman,et al.  Rapid and sensitive protein similarity searches. , 1985, Science.

[22]  Azriel Rosenfeld,et al.  Cluster detection in background noise , 1989, Pattern Recognit..

[23]  Takio Kurita,et al.  An efficient agglomerative clustering algorithm using a heap , 1991, Pattern Recognit..

[24]  A. Karshikoff,et al.  Optimization of the electrostatic interactions in proteins of different functional and folding type , 1994, Protein science : a publication of the Protein Society.

[25]  U. Hobohm,et al.  Selection of representative protein data sets , 1992, Protein science : a publication of the Protein Society.

[26]  Tapio Salakoski,et al.  General formulation and evaluation of agglomerative clustering methods with metric and non-metric distances , 1993, Pattern Recognit..

[27]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[28]  A. Lesk COMPUTATIONAL MOLECULAR BIOLOGY , 1988, Proceeding of Data For Discovery.

[29]  G. W. Milligan,et al.  The validation of four ultrametric clustering algorithms , 1980, Pattern Recognit..

[30]  M. Vihinen,et al.  C-terminal truncations of a thermostable Bacillus stearothermophilus alpha-amylase. , 1994, Protein engineering.