Sequence Alignment, Mutual Information, and Dissimilarity Measures for Constructing Phylogenies

Background Existing sequence alignment algorithms use heuristic scoring schemes based on biological expertise, which cannot be used as objective distance metrics. As a result one relies on crude measures, like the p- or log-det distances, or makes explicit, and often too simplistic, a priori assumptions about sequence evolution. Information theory provides an alternative, in the form of mutual information (MI). MI is, in principle, an objective and model independent similarity measure, but it is not widely used in this context and no algorithm for extracting MI from a given alignment (without assuming an evolutionary model) is known. MI can be estimated without alignments, by concatenating and zipping sequences, but so far this has only produced estimates with uncontrolled errors, despite the fact that the normalized compression distance based on it has shown promising results. Results We describe a simple approach to get robust estimates of MI from global pairwise alignments. Our main result uses algorithmic (Kolmogorov) information theory, but we show that similar results can also be obtained from Shannon theory. For animal mitochondrial DNA our approach uses the alignments made by popular global alignment algorithms to produce MI estimates that are strikingly close to estimates obtained from the alignment free methods mentioned above. We point out that, due to the fact that it is not additive, normalized compression distance is not an optimal metric for phylogenetics but we propose a simple modification that overcomes the issue of additivity. We test several versions of our MI based distance measures on a large number of randomly chosen quartets and demonstrate that they all perform better than traditional measures like the Kimura or log-det (resp. paralinear) distances. Conclusions Several versions of MI based distances outperform conventional distances in distance-based phylogeny. Even a simplified version based on single letter Shannon entropies, which can be easily incorporated in existing software packages, gave superior results throughout the entire animal kingdom. But we see the main virtue of our approach in a more general way. For example, it can also help to judge the relative merits of different alignment algorithms, by estimating the significance of specific alignments. It strongly suggests that information theory concepts can be exploited further in sequence analysis.

[1]  Trevor I. Dix,et al.  A Simple Statistical Algorithm for Biological Sequence Compression , 2007, 2007 Data Compression Conference (DCC'07).

[2]  Alexander Kraskov,et al.  MIC: Mutual Information Based Hierarchical Clustering , 2008, 0809.1605.

[3]  M. Steel,et al.  Recovering evolutionary trees under a more realistic model of sequence evolution. , 1994, Molecular biology and evolution.

[4]  Wei Zhang,et al.  Random local neighbor joining: a new method for reconstructing phylogenetic trees. , 2008, Molecular phylogenetics and evolution.

[5]  M. Stanhope,et al.  Molecular systematics of armadillos (Xenarthra, Dasypodidae): contribution of maximum likelihood and Bayesian analyses of mitochondrial and nuclear genes. , 2003, Molecular phylogenetics and evolution.

[6]  J. E. Glynn,et al.  Numerical Recipes: The Art of Scientific Computing , 1989 .

[7]  Bin Ma,et al.  The similarity metric , 2001, IEEE Transactions on Information Theory.

[8]  Paul M. B. Vitányi,et al.  Clustering by compression , 2003, IEEE Transactions on Information Theory.

[9]  Inna Dubchak,et al.  Glocal alignment: finding rearrangements during alignment , 2003, ISMB.

[10]  Jonas S. Almeida,et al.  Alignment-free sequence comparison-a review , 2003, Bioinform..

[11]  Alexander Kraskov,et al.  Hierarchical Clustering Based on Mutual Information , 2003, ArXiv.

[12]  宁北芳,et al.  疟原虫var基因转换速率变化导致抗原变异[英]/Paul H, Robert P, Christodoulou Z, et al//Proc Natl Acad Sci U S A , 2005 .

[13]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[14]  William H. Press,et al.  Numerical recipes , 1990 .

[15]  A. Oskooi Molecular Evolution and Phylogenetics , 2008 .

[16]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[17]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[18]  Xin Chen,et al.  An information-based sequence distance and its application to whole mitochondrial genome phylogeny , 2001, Bioinform..

[19]  Chuong B. Do,et al.  Access the most recent version at doi: 10.1101/gr.926603 References , 2003 .

[20]  Aleksandar Milosavljevic,et al.  Discovering dependencies via algorithmic mutual information: A case study in DNA sequence comparisons , 1995, Machine Learning.

[21]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[22]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[23]  김삼묘,et al.  “Bioinformatics” 특집을 내면서 , 2000 .

[24]  Journal of Systematic Palaeontology , 2010 .

[25]  Gapped BLAST and PSI-BLAST: A new , 1997 .

[26]  J. Lake,et al.  Reconstructing evolutionary trees from DNA and protein sequences: paralinear distances. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[27]  Thomas L. Madden,et al.  BLAST: at the core of a powerful and diverse set of sequence analysis tools , 2004, Nucleic Acids Res..

[28]  Jean-Paul Delahaye,et al.  Transformation distances: a family of dissimilarity measures based on movements of segments , 1999, Bioinform..

[29]  Srinivas Aluru,et al.  Handbook Of Computational Molecular Biology , 2010 .

[30]  G. Steyskal Systematic Entomology , 1976 .

[31]  P. Buneman A Note on the Metric Properties of Trees , 1974 .

[32]  Lior Pachter,et al.  MAVID multiple alignment server , 2003, Nucleic Acids Res..

[33]  Paul A. Viola,et al.  Alignment by Maximization of Mutual Information , 1997, International Journal of Computer Vision.

[34]  G. Barlow,et al.  Fishes of the world , 2004, Environmental Biology of Fishes.

[35]  Erik L. L. Sonnhammer,et al.  Kalign – an accurate and fast multiple sequence alignment algorithm , 2005, BMC Bioinformatics.

[36]  M. A. Steel,et al.  Confidence in evolutionary trees from biological sequence data , 1993, Nature.

[37]  D. Saad Europhysics Letters , 1997 .

[38]  A. Wyner,et al.  Analysis and Optimization of Systems , 1988 .

[39]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[40]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[41]  J. Felsenstein Evolutionary trees from DNA sequences: A maximum likelihood approach , 2005, Journal of Molecular Evolution.

[42]  Jorma Rissanen,et al.  Stochastic Complexity and Statistical Inference , 1986 .

[43]  Bernard S. Wostmann,et al.  Panel of referees , 2007 .

[44]  Trevor I. Dix,et al.  Compression and Approximate Matching , 1999, Comput. J..

[45]  James A. Storer,et al.  DATA COMPRESSION CONFERENCE , 2001 .