Optimal classification of protein sequences and selection of representative sets from multiple alignments: application to homologous families and lessons for structural genomics.

Hierarchical classification is probably the most popular approach to group related proteins. However, there are a number of problems associated with its use for this purpose. One is that the resulting tree showing a nested sequence of groups may not be the most suitable representation of the data. Another is that visual inspection is the most common method to decide the most appropriate number of subsets from a tree. In fact, classification of proteins in general is bedevilled with the need for subjective thresholds to define group membership (e.g., 'significant' sequence identity for homologous families). Such arbitrariness is not only intellectually unsatisfying but also has important practical consequences. For instance, it hinders meaningful identification of protein targets for structural genomics. I describe an alternative approach to cluster related proteins without the need for an a priori threshold: one, through its use of dynamic programming, which is guaranteed to produce globally optimal solutions at all levels of partition granularity. Grouping proteins according to weights assigned to their aligned sequences makes it possible to delineate dynamically a 'core-periphery' structure within families. The 'core' of a protein family comprises the most typical sequences while the 'periphery' consists of the atypical ones. Further, a new sequence weighting scheme that combines the information in all the multiply aligned positions of an alignment in a novel way is put forward. Instead of averaging over all positions, this procedure takes into account directly the distribution of sequence variability along an alignment. The relationships between sequence weights and sequence identity are investigated for 168 families taken from HOMSTRAD, a database of protein structure alignments for homologous families. An exact solution is presented for the problem of how to select the most representative pair of sequences for a protein family. Extension of this approach by a greedy algorithm allows automatic identification of a minimal set of aligned sequences. The results of this analysis are available on the Web at http://mathbio.nimr.mrc.ac.uk/~amay.

[1]  D. Hawkins,et al.  Optimal zonation of digitized sequential data , 1973 .

[2]  Michael S. Waterman,et al.  Locating maximum variance segments in sequential data , 1977 .

[3]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[4]  H. Charles Romesburg,et al.  Cluster analysis for researchers , 1984 .

[5]  S. Lanyon,et al.  DETECTING INTERNAL INCONSISTENCIES IN DISTANCE DATA , 1985 .

[6]  T. L. Blundell,et al.  Knowledge-based prediction of protein structures and the design of novel molecules , 1987, Nature.

[7]  A. D. McLachlan,et al.  Profile analysis: detection of distantly related proteins. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[8]  S F Altschul,et al.  Weights for data related by a tree. , 1989, Journal of molecular biology.

[9]  P. Argos,et al.  Weighting aligned protein or nucleic acid sequences to correct for unequal representation. , 1990, Journal of molecular biology.

[10]  B. Erman,et al.  Information‐theoretical entropy as a measure of sequence variability , 1991, Proteins.

[11]  C. Sander,et al.  Database of homology‐derived protein structures and the structural meaning of sequence alignment , 1991, Proteins.

[12]  U. Hobohm,et al.  Selection of representative protein data sets , 1992, Protein science : a publication of the Protein Society.

[13]  T. Salakoski,et al.  Selection of a representative set of structures from brookhaven protein data bank , 1992, Proteins.

[14]  M Vingron,et al.  Weighting in sequence space: a comparison of methods in terms of generalized sequences. , 1993, Proceedings of the National Academy of Sciences of the United States of America.

[15]  S. Henikoff,et al.  Position-based sequence weights. , 1994, Journal of molecular biology.

[16]  C. Chothia,et al.  Volume changes in protein evolution. , 1994, Journal of molecular biology.

[17]  C. Sander,et al.  A method to predict functional residues in proteins , 1995, Nature Structural Biology.

[18]  A. D. Gordon A survey of constrained classification , 1996 .

[19]  S. Oliver From DNA sequence to biological function , 1996, Nature.

[20]  F. Cohen,et al.  An evolutionary trace method defines binding surfaces common to protein families. , 1996, Journal of molecular biology.

[21]  P. Sopp Cluster analysis. , 1996, Veterinary immunology and immunopathology.

[22]  David C. Jones,et al.  CATH--a hierarchic classification of protein domain structures. , 1997, Structure.

[23]  C. Chothia,et al.  Intermediate sequences increase the detection of homology between sequences. , 1997, Journal of molecular biology.

[24]  O. Ptitsyn,et al.  Protein folding and protein evolution: common folding nucleus in different subfamilies of c-type cytochromes? , 1998, Journal of molecular biology.

[25]  M. Amitai Hidden models in biopolymers. , 1998, Science.

[26]  Chris Sander,et al.  Removing near-neighbour redundancy from large protein sequence collections , 1998, Bioinform..

[27]  John P. Overington,et al.  HOMSTRAD: A database of protein structure alignments for homologous families , 1998, Protein science : a publication of the Protein Society.

[28]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[29]  Mark J. Forster,et al.  Application of distance geometry to 3D visualization of sequence relationships , 1999, Bioinform..

[30]  Robert D. Finn,et al.  Pfam 3.1: 1313 multiple alignments and profile HMMs match the majority of proteins , 1999, Nucleic Acids Res..

[31]  M. Boguski,et al.  Biosequence exegesis : Genome , 1999 .

[32]  O. Ptitsyn,et al.  Non-functional conserved residues in globins and their possible role as a folding nucleus. , 1999, Journal of molecular biology.

[33]  Winona C. Barker,et al.  PIR-ALN: a database of protein sequence alignments , 1999, Bioinform..

[34]  L. Mirny,et al.  Universally conserved positions in protein folds: reading evolutionary signals about stability, folding kinetics and function. , 1999, Journal of molecular biology.

[35]  A C May,et al.  Toward more meaningful hierarchical classification of protein three‐dimensional structures , 1999, Proteins.

[36]  Jérôme Gouzy,et al.  Browsing protein families via the 'Rich Family Description' format , 1999, Bioinform..

[37]  A. Sali,et al.  Structural genomics: beyond the Human Genome Project , 1999, Nature Genetics.

[38]  Chris Sander,et al.  Protein folds and families: sequence and structure alignments , 1999, Nucleic Acids Res..

[39]  S. Sunyaev,et al.  PSIC: profile extraction from sequence alignments with position-specific counts of independent observations. , 1999, Protein engineering.

[40]  W R Taylor,et al.  Coevolving protein residues: maximum likelihood identification and relationship to structure. , 1999, Journal of molecular biology.

[41]  J. Mesirov,et al.  Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[42]  B. Snel,et al.  Genome phylogeny based on gene content , 1999, Nature Genetics.

[43]  R. Ranganathan,et al.  Evolutionarily conserved pathways of energetic connectivity in protein families. , 1999, Science.

[44]  T F Smith,et al.  The art of matchmaking: sequence alignment methods and their structural implications. , 1999, Structure.

[45]  A cautionary note on interpretation of hierarchical classifications of protein folds. , 1999, Structure.

[46]  김삼묘,et al.  “Bioinformatics” 특집을 내면서 , 2000 .