Euclidian space and grouping of biological objects

MOTIVATION Biological objects tend to cluster into discrete groups. Objects within a group typically possess similar properties. It is important to have fast and efficient tools for grouping objects that result in biologically meaningful clusters. Protein sequences reflect biological diversity and offer an extraordinary variety of objects for polishing clustering strategies. Grouping of sequences should reflect their evolutionary history and their functional properties. Visualization of relationships between sequences is of no less importance. Tree-building methods are typically used for such visualization. An alternative concept to visualization is a multidimensional sequence space. In this space, proteins are defined as points and distances between the points reflect the relationships between the proteins. Such a space can also be a basis for model-based clustering strategies that typically produce results correlating better with biological properties of proteins. RESULTS We developed an approach to classification of biological objects that combines evolutionary measures of their similarity with a model-based clustering procedure. We apply the methodology to amino acid sequences. On the first step, given a multiple sequence alignment, we estimate evolutionary distances between proteins measured in expected numbers of amino acid substitutions per site. These distances are additive and are suitable for evolutionary tree reconstruction. On the second step, we find the best fit approximation of the evolutionary distances by Euclidian distances and thus represent each protein by a point in a multidimensional space. The Euclidian space may be projected in two or three dimensions and the projections can be used to visualize relationships between proteins. On the third step, we find a non-parametric estimate of the probability density of the points and cluster the points that belong to the same local maximum of this density in a group. The number of groups is controlled by a sigma-parameter that determines the shape of the density estimate and the number of maxima in it. The grouping procedure outperforms commonly used methods such as UPGMA and single linkage clustering.

[1]  D K Agrafiotis,et al.  A new method for analyzing protein sequence relationships based on Sammon maps , 1997, Protein science : a publication of the Protein Society.

[2]  M. O. Dayhoff,et al.  22 A Model of Evolutionary Change in Proteins , 1978 .

[3]  S. Eddy Hidden Markov models. , 1996, Current opinion in structural biology.

[4]  S. Henikoff,et al.  Blocks‐based methods for detecting protein homology , 2000, Electrophoresis.

[5]  Golan Yona,et al.  Towards a Complete Map of the Protein Space Based on a Unified Sequence and Structure Analysis of All Known Proteins , 2000, ISMB.

[6]  Bertrand Séraphin,et al.  Sm and Sm‐like proteins assemble in two related complexes of deep evolutionary origin , 1999, The EMBO journal.

[7]  Desmond G. Higgins Sequence ordinations: a multivariate analysis approach to analysing large sequence data sets , 1992, Comput. Appl. Biosci..

[8]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[9]  B. Séraphin Sm and Sm‐like proteins belong to a large family: identification of proteins of the U6 as well as the U1, U2, U4 and U5 snRNPs. , 1995, The EMBO journal.

[10]  N. Saitou,et al.  Reconstruction of gene trees from sequence data. , 1996, Methods in enzymology.

[11]  János Podani,et al.  Introduction to the exploration of multivariate biological data , 2000 .

[12]  W. Fitch Distinguishing homologous from analogous proteins. , 1970, Systematic zoology.

[13]  L. Pauling,et al.  Molecules as documents of evolutionary history. , 1965, Journal of theoretical biology.

[14]  S. Henikoff,et al.  Amino acid substitution matrices. , 2000, Advances in protein chemistry.

[15]  Russell F. Doolittle,et al.  Converting Amino Acid Alignment Scores into Measures of Evolutionary Time: A Simulation Study of Various Relationships , 1997, Journal of Molecular Evolution.

[16]  N. Grishin,et al.  From complete genomes to measures of substitution rate variability within and between proteins. , 2000, Genome research.

[17]  Jimin Pei,et al.  Structure Prediction and Active Site Analysis of the Metal Binding Determinants in γ-Glutamylcysteine Synthetase* , 2001, The Journal of Biological Chemistry.

[18]  D. Lipman,et al.  A genomic perspective on protein families. , 1997, Science.

[19]  L. Holm,et al.  Unification of protein families. , 1998, Current opinion in structural biology.

[20]  N Takezaki,et al.  Estimation of evolutionary distance for reconstructing molecular phylogenetic trees. , 1994, Molecular biology and evolution.

[21]  P. Sopp Cluster analysis. , 1996, Veterinary immunology and immunopathology.

[22]  E. Koonin,et al.  Gleaning non-trivial structural, functional and evolutionary information about proteins by iterative database searches. , 1999, Journal of molecular biology.

[23]  N. Wicker,et al.  Secator: a program for inferring protein subfamilies from phylogenetic trees. , 2001, Molecular biology and evolution.

[24]  W. Li,et al.  Estimating evolutionary distances between DNA sequences. , 1996 .

[25]  J. Simonoff Smoothing Methods in Statistics , 1998 .

[26]  Thomas Uzzell,et al.  Fitting Discrete Probability Distributions to Evolutionary Events , 1971, Science.

[27]  Mark J. Forster,et al.  Application of distance geometry to 3D visualization of sequence relationships , 1999, Bioinform..

[28]  C Sander,et al.  New structure--novel fold? , 1997, Structure.

[29]  J. Felsenstein Inferring phylogenies from protein sequences by parsimony, distance, and likelihood methods. , 1996, Methods in enzymology.

[30]  P. Groenen,et al.  Modern multidimensional scaling , 1996 .

[31]  J. Zhang,et al.  Correlation between the substitution rate and rate variation among sites in protein evolution. , 1998, Genetics.

[32]  Nick V. Grishin,et al.  Estimation of the number of amino acid substitutions per site when the substitution rate varies among sites , 1995, Journal of Molecular Evolution.