CASTOR: Clustering Algorithm for Sequence Taxonomical Organization and Relationships

Given a set of related proteins, two important problems in biology are the inference of protein subsets such that members of one subset share a common function and the identification of protein regions that possess functional significance. The former is typically approached by hierarchical bottom-up clustering based on pairwise sequence similarity and various linkage rules. The latter is typically approached in a supervised manner, based on global multiple sequence alignment. However, the two problems are inextricably linked, since functional subsets are usually characterized by distinctive functional regions. This paper introduces CASTOR, an automatic and unsupervised system that addresses both problems simultaneously and efficiently. It identifies protein regions that are likely to have functional significance by discovering and refining statistically significant motifs. It infers likely functional protein subsets and their relationships based on the presence of the discovered motifs in a top-down and recursive manner, allowing the identification of both hierarchical and nonhierarchical subset relationships. This is, to our knowledge, the first system that approaches both problems simultaneously in a top-down, systematic manner. CASTOR's performance is evaluated against the G-protein coupled receptor superfamily. The identified protein regions lead to a taxonomical organization of this superfamily that is in remarkable agreement with a biologically motivated one and which outperforms those produced by bottom-up clustering methods. We also find that conventional hierarchical representations may fail to accurately describe the complexity of evolutionary development responsible for the final organization of a complex protein family. In particular, many functional relationships governing distant subfamilies of such a protein family may not be represented hierarchically.

[1]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[2]  A Bairoch,et al.  The SWISS-PROT protein sequence database: its relevance to human molecular medical research. , 1997, Journal of molecular medicine.

[3]  D. Lipman,et al.  A genomic perspective on protein families. , 1997, Science.

[4]  D. Haussler,et al.  Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.

[5]  S. Pietrokovski Searching databases of conserved sequence regions by aligning protein multiple-alignments. , 1996, Nucleic acids research.

[6]  Ajay K. Royyuru,et al.  Systematic and Fully Automated Identification of Protein Sequence Patterns , 2000, J. Comput. Biol..

[7]  Anton J. Enright,et al.  GeneRAGE: a robust algorithm for sequence clustering and domain detection , 2000, Bioinform..

[8]  Douglas L. Brutlag,et al.  Discovering Empirically Conserved Amino Acid Substitution Groups in Databases of Protein Families , 1996, ISMB.

[9]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[10]  G Vriend,et al.  Receptors coupling to G proteins: Is there a signal behind the sequence? , 2000, Proteins.

[11]  B. Rost PHD: predicting one-dimensional protein structure by profile-based neural networks. , 1996, Methods in enzymology.

[12]  Jérôme Gracy,et al.  Automated protein sequence database classification. I. Integration of compositional similarity search, local similarity search, and multiple sequence alignment , 1998, Bioinform..

[13]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[14]  R. K. Shyamasundar,et al.  Introduction to algorithms , 1996 .

[15]  F. Cohen,et al.  An evolutionary trace method defines binding surfaces common to protein families. , 1996, Journal of molecular biology.

[16]  Andrea Califano,et al.  Functional classification of proteins by pattern discovery and top-down clustering of primary sequences , 2001, IBM Syst. J..

[17]  E. Sonnhammer,et al.  Modular arrangement of proteins as inferred from analysis of homology , 1994, Protein science : a publication of the Protein Society.

[18]  D. Brutlag,et al.  Highly specific protein sequence motifs for genome analysis. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[19]  Jérôme Gracy,et al.  Automated protein sequence database classification. II. Delineation Of domain boundaries from sequence similarities , 1998, Bioinform..

[20]  David Haussler,et al.  Using Dirichlet Mixture Priors to Derive Hidden Markov Models for Protein Families , 1993, ISMB.

[21]  Amir Dembo,et al.  Statistical Composition of High-Scoring Segments from Molecular Sequences , 1990 .

[22]  Erik L. L. Sonnhammer,et al.  A Hidden Markov Model for Predicting Transmembrane Helices in Protein Sequences , 1998, ISMB.

[23]  Lawrence Hunter,et al.  Computationally Efficient Cluster Representation in Molecular Sequence Megaclassification , 1993, ISMB.

[24]  Chris Sander,et al.  The HSSP database of protein structure-sequence alignments and family profiles , 1998, Nucleic Acids Res..

[25]  T. Attwood,et al.  PRINTS--a database of protein motif fingerprints. , 1994, Nucleic acids research.

[26]  Andrea Califano,et al.  SPLASH: structural pattern localization analysis by sequential histograms , 2000, Bioinform..

[27]  C. Sander,et al.  The HSSP database of protein structure-sequence alignments. , 1994, Nucleic acids research.

[28]  S. Watson,et al.  The G-Protein Linked Receptor Facts Book , 1994 .

[29]  Nathan Linial,et al.  A Map of the Protein Space: An Automatic Hierarchical Classification of all Protein Sequences , 1998, ISMB.

[30]  M. Nei,et al.  The neighbor-joining method , 1987 .

[31]  Gert Vriend,et al.  GPCRDB information system for G protein-coupled receptors , 2003, Nucleic Acids Res..

[32]  Amos Bairoch,et al.  The PROSITE dictionary of sites and patterns in proteins, its current status , 1993, Nucleic Acids Res..

[33]  C. Sander,et al.  A method to predict functional residues in proteins , 1995, Nature Structural Biology.

[34]  Amir Dembo,et al.  Strong limit theorems of empirical functionals for large exceedances of partial sums of i , 1991 .