Information Content of Sets of Biological Sequences Revisited

To analyze the information included in a pool of amino acid sequences, a first approach is to align the sequences, to estimate the probability of each amino acid to occur within columns of the aligned sequences and to combine these values through an “entropy” function whose minimum corresponds to absence of information, that is, to the case where each amino acid has the same probability to occur. Another alternative is to construct a distance tree between sequences (issued by the alignment) based on sequence similarity and to properly interpret the tree topology so to model the evolutionary property of residue conservation. We introduce the concept of “evolutionary content” of a tree of sequences, and demonstrate at what extent the more classical notion of “information content” on sequences approximates the new measure and in what manner tree topology contributes sharper information for the detection of protein binding sites.

[1]  D. Higgins,et al.  Bioinformatics : sequence, structure, and databanks , 2000 .

[2]  J. Thornton,et al.  Predicting protein function from sequence and structural data. , 2005, Current opinion in structural biology.

[3]  D. Higgins,et al.  Multiple sequence alignments. , 2005, Current opinion in structural biology.

[4]  J. D. Thompson,et al.  Multiple alignment of complete sequences (MACS) in the post-genomic era. , 2001, Gene.

[5]  Gürol M. Süel,et al.  Evolutionarily conserved networks of residues mediate allosteric communication in proteins , 2003, Nature Structural Biology.

[6]  R. Ranganathan,et al.  Evolutionarily conserved pathways of energetic connectivity in protein families. , 1999, Science.

[7]  Olivier Poch,et al.  A comprehensive comparison of multiple sequence alignment programs , 1999, Nucleic Acids Res..

[8]  John Moult,et al.  A decade of CASP: progress, bottlenecks and prognosis in protein structure prediction. , 2005, Current opinion in structural biology.

[9]  C. Adami,et al.  Physical complexity of symbolic sequences , 1996, adap-org/9605002.

[10]  Laurent Duret,et al.  Multiple alignments for structural functional or phylogenetic analyses of homologous sequences , 2000 .

[11]  Thomas Simonson,et al.  Computational protein design: Software implementation, parameter optimization, and performance of a simple model , 2008, J. Comput. Chem..

[12]  M. Levitt,et al.  Simulating protein evolution in sequence and structure space. , 2004, Current opinion in structural biology.

[13]  A. Phillips,et al.  Multiple sequence alignment in phylogenetic analysis. , 2000, Molecular phylogenetics and evolution.

[14]  Alessandra Carbone,et al.  Joint Evolutionary Trees: A Large-Scale Method To Predict Protein Interfaces Based on Sequence Sampling , 2009, PLoS Comput. Biol..

[15]  V. Viasnoff,et al.  Encoding folding paths of RNA switches , 2006, Nucleic acids research.

[16]  C. Notredame,et al.  Recent progress in multiple sequence alignment: a survey. , 2002, Pharmacogenomics.

[17]  James M. Carothers,et al.  Informational Complexity and Functional Activity of RNA Structures , 2004, Journal of the American Chemical Society.

[18]  O. Lichtarge,et al.  A family of evolution-entropy hybrid methods for ranking protein residues by importance. , 2004, Journal of molecular biology.

[19]  Daniel R. Caffrey,et al.  Are protein–protein interfaces more conserved in sequence than the rest of the protein surface? , 2004, Protein science : a publication of the Protein Society.

[20]  Cédric Notredame,et al.  Recent Evolutions of Multiple Sequence Alignment Algorithms , 2007, PLoS Comput. Biol..

[21]  G. Petsko,et al.  Crystal structure of a D-amino acid aminotransferase: how the protein controls stereoselectivity. , 1995, Biochemistry.