Hierarchical clustering based upon contextual alignment of proteins: a different way to approach phylogeny.

We perform a computational study using a new approach to the analysis of protein sequences. The contextual alignment model, proposed recently by Gambin et al. (2002), is based on the assumption that, while constructing an alignment, the score of a substitution of one residue by another depends on the surrounding residues. The contextual alignment scores calculated in this model were used to hierarchical clustering of several protein families from the database of Clusters of Orthologous Groups (COG). The clustering has been also constructed based on the standard approach. The comparative analysis shows that the contextual model results in more consistent clustering trees. The difference, although small, is with no exception in favour of the contextual model. The consistency of the family of trees is measured by several consensus and agreement methods, as well as by the inter-tree distance approach.

[1]  Charles Semple,et al.  A supertree method for rooted trees , 2000, Discret. Appl. Math..

[2]  A. von Haeseler,et al.  A stochastic model for the evolution of autocorrelated DNA sequences. , 1994, Molecular phylogenetics and evolution.

[3]  B. Baum Combining trees as a way of combining data sets for phylogenetic inference, and the desirability of combining gene trees , 1992 .

[4]  Arne Elofsson Bioinformatics: From nucleic acids and proteins to cell metabolism: Edited by D. Schomburg and U. Lessel, VCH; Weinheim-New York, 1995. viii + 195 pp. DM 148.00 (hb). ISBN 3-527-30072-4 , 1996 .

[5]  Martin Vingron,et al.  Statistical Significance of Local Alignments with Gaps , 2007 .

[6]  Hervé Philippe,et al.  Phylogeny: A non-hyperthermophilic ancestor for Bacteria , 2002, Nature.

[7]  Anna Gambin Substitution Matrices for Contextual Alignment , 2002 .

[8]  J. Doyle,et al.  Gene Trees and Species Trees: Molecular Systematics as One-Character Taxonomy , 1992 .

[9]  P. Pardalos,et al.  Handbook of Combinatorial Optimization , 1998 .

[10]  Louis J. Billera,et al.  Geometry of the Space of Phylogenetic Trees , 2001, Adv. Appl. Math..

[11]  D. Lipman,et al.  A genomic perspective on protein families. , 1997, Science.

[12]  Jerzy Tiuryn,et al.  Contextual alignment of biological sequences , 2002, ECCB.

[13]  C. Kurland Something for everyone , 2000, EMBO reports.

[14]  Joseph L. Thorley,et al.  Cladistic Information, Leaf Stability And Supertree Construction , 2000 .

[15]  J L Risler,et al.  Phylogeny of related functions: the case of polyamine biosynthetic enzymes. , 2000, Microbiology.

[16]  J. L. Jensen,et al.  Probabilistic models of DNA sequence evolution with context dependent rates of substitution , 2000, Advances in Applied Probability.

[17]  Chantal Korostensky,et al.  Optimal Scoring Matrices for Estimating Distances Between Aligned Sequences , 1999 .

[18]  Edward N. AdamsIII N-trees as nestings: Complexity, similarity, and consensus , 1986 .

[19]  Xin He,et al.  Computing Distances between Evolutionary Trees , 1998 .

[20]  M Steel,et al.  Simple but fundamental limitations on supertree and consensus tree methods. , 2000, Systematic biology.

[21]  Michael Y. Galperin,et al.  The COG database: new developments in phylogenetic classification of proteins from complete genomes , 2001, Nucleic Acids Res..

[22]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[23]  E. N. Adams,et al.  N-trees as nestings: Complexity, similarity, and consensus , 1986 .

[24]  D. Sankoff,et al.  Comparative Genomics: "Empirical And Analytical Approaches To Gene Order Dynamics, Map Alignment And The Evolution Of Gene Families" , 2000 .

[25]  R F Doolittle,et al.  Progressive alignment of amino acid sequences and construction of phylogenetic trees from them. , 1996, Methods in enzymology.

[26]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[27]  N Linial,et al.  Global self-organization of all known protein sequences reveals inherent biological signatures. , 1997, Journal of molecular biology.

[28]  P Argos,et al.  Oligopeptide biases in protein sequences and their use in predicting protein coding regions in nucleotide sequences , 1988, Proteins.

[29]  G Perrière,et al.  Bacterial molecular phylogeny using supertree approach. , 2001, Genome informatics. International Conference on Genome Informatics.

[30]  Mikkel Thorup,et al.  On the Agreement of Many Trees , 1995, Inf. Process. Lett..

[31]  Anna Gambin,et al.  Contextual Multiple Sequence Alignment , 2005, Journal of biomedicine & biotechnology.

[32]  Jerzy Tiuryn,et al.  Alignment with Context Dependent Scoring Function , 2006, J. Comput. Biol..

[33]  Emil Grosswald,et al.  The Theory of Partitions , 1984 .