Constructing a meaningful evolutionary average at the phylogenetic center of mass

BackgroundAs a consequence of the evolutionary process, data collected from related species tend to be similar. This similarity by descent can obscure subtler signals in the data such as the evidence of constraint on variation due to shared selective pressures. In comparative sequence analysis, for example, sequence similarity is often used to illuminate important regions of the genome, but if the comparison is between closely related species, then similarity is the rule rather than the interesting exception. Furthermore, and perhaps worse yet, the contribution of a divergent third species may be masked by the strong similarity between the other two. Here we propose a remedy that weighs the contribution of each species according to its phylogenetic placement.ResultsWe first solve the problem of summarizing data related by phylogeny, and we explain why an average should operate on the entire evolutionary trajectory that relates the data. This perspective leads to a new approach in which we define the average in terms of the phylogeny, using the data and a stochastic model to obtain a probability on evolutionary trajectories. With the assumption that the data evolve according to a Brownian motion process on the tree, we show that our evolutionary average can be computed as convex combination of the species data. Thus, our approach, called the BranchManager, defines both an average and a novel taxon weighting scheme. We compare the BranchManager to two other methods, demonstrating why it exhibits desirable properties. In doing so, we devise a framework for comparison and introduce the concept of a representative point at which the average is situated.ConclusionThe BranchManager uses as its representative point the phylogenetic center of mass, a choice which has both intuitive and practical appeal. Because our average is intrinsic to both the dataset and to the phylogeny, we expect it and its corresponding weighting scheme to be useful in all sorts of studies where interspecies data need to be combined. Obvious applications include evolutionary studies of morphology, physiology or behaviour, but quantitative measures such as sequence hydrophobicity and gene expression level are amenable to our approach as well. Other areas of potential impact include motif discovery and vaccine design. A Java implementation of the BranchManager is available for download, as is a script written in the statistical language R.

[1]  R. Doolittle,et al.  A simple method for displaying the hydropathic character of a protein. , 1982, Journal of molecular biology.

[2]  G. Crooks,et al.  WebLogo: a sequence logo generator. , 2004, Genome research.

[3]  T. D. Schneider,et al.  Sequence logos: a new way to display consensus sequences. , 1990, Nucleic acids research.

[4]  M. Levitt Conformational preferences of amino acids in globular proteins. , 1978, Biochemistry.

[5]  Yoshihiro Yamanishi,et al.  The inference of protein-protein interactions by co-evolutionary analysis is improved by excluding the information about the phylogenetic relationships , 2005, Bioinform..

[6]  B. Ashby,et al.  Dopamine and schizophrenia , 1990, Nature.

[7]  L. Cavalli-Sforza,et al.  ANALYSIS OF HUMAN EVOLUTION UNDER RANDOM GENETIC DRIFT. , 1964, Cold Spring Harbor symposia on quantitative biology.

[8]  Bette Korber,et al.  Structure of a V3-Containing HIV-1 gp120 Core , 2005, Science.

[9]  R. Durrett Probability: Theory and Examples , 1993 .

[10]  L. Cavalli-Sforza,et al.  PHYLOGENETIC ANALYSIS: MODELS AND ESTIMATION PROCEDURES , 1967, Evolution; international journal of organic evolution.

[11]  S F Altschul,et al.  Weights for data related by a tree. , 1989, Journal of molecular biology.

[12]  Osamu Gotoh,et al.  A weighting system and algorithm for aligning many phylogenetically related sequences , 1995, Comput. Appl. Biosci..

[13]  Martin Vingron,et al.  A fast and sensitive multiple sequence alignment algorithm , 1989, Comput. Appl. Biosci..

[14]  J. Felsenstein Maximum-likelihood estimation of evolutionary trees from continuous characters. , 1973, American journal of human genetics.

[15]  J. Felsenstein Phylogenies and the Comparative Method , 1985, The American Naturalist.

[16]  P. Ponnuswamy,et al.  Positional flexibilities of amino acid residues in globular proteins , 2009 .

[17]  Adam Eyre-Walker,et al.  Molecular Evolution by Wen-Hsiung Li. Published by Sinauer Associates, Sunderland, MA, USA. ISBN: 0-87893-463-4 (cloth). , 1997 .

[18]  Anders Krogh,et al.  Maximum Entropy Weighting of Aligned Sequences of Proteins or DNA , 1995, ISMB.

[19]  William R. Taylor,et al.  The rapid generation of mutation data matrices from protein sequences , 1992, Comput. Appl. Biosci..

[20]  C. Lawrence,et al.  The Relative Inefficiency of Sequence Weights Approaches in Determining a Nucleotide Position Weight Matrix , 2005, Statistical applications in genetics and molecular biology.

[21]  C. Chothia,et al.  Volume changes in protein evolution. , 1994, Journal of molecular biology.

[22]  P. Sharp,et al.  Rates and dates of divergence between AIDS virus nucleotide sequences. , 1988, Molecular biology and evolution.

[23]  C. Debouck,et al.  Human immunodeficiency virus type 1 neutralization epitope with conserved architecture elicits early type-specific antibodies in experimentally infected chimpanzees. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[24]  D. Nickle,et al.  Consensus and Ancestral State HIV Vaccines , 2003, Science.

[25]  W. Bruno Modeling residue usage in aligned protein sequences via maximum likelihood. , 1996, Molecular biology and evolution.

[26]  J. Janin,et al.  Surface and inside volumes in globular proteins , 1979, Nature.

[27]  S. Altschul,et al.  Equal animals , 1990, Nature.

[28]  M Vingron,et al.  Weighting in sequence space: a comparison of methods in terms of generalized sequences. , 1993, Proceedings of the National Academy of Sciences of the United States of America.

[29]  P. Argos,et al.  Weighting aligned protein or nucleic acid sequences to correct for unequal representation. , 1990, Journal of molecular biology.

[30]  Feng Gao,et al.  Diversity Considerations in HIV-1 Vaccine Selection , 2002, Science.

[31]  Julie Dawn Thompson,et al.  Improved sensitivity of profile searches through the use of sequence weights and gap excision , 1994, Comput. Appl. Biosci..

[32]  S. Henikoff,et al.  Position-based sequence weights. , 1994, Journal of molecular biology.