On the Entropy of Protein Families

Proteins are essential components of living systems, capable of performing a huge variety of tasks at the molecular level, such as recognition, signalling, copy, transport, ... The protein sequences realizing a given function may largely vary across organisms, giving rise to a protein family. Here, we estimate the entropy of those families based on different approaches, including Hidden Markov Models used for protein databases and inferred statistical models reproducing the low-order (1- and 2-point) statistics of multi-sequence alignments. We also compute the entropic cost, that is, the loss in entropy resulting from a constraint acting on the protein, such as the mutation of one particular amino-acid on a specific site, and relate this notion to the escape probability of the HIV virus. The case of lattice proteins, for which the entropy can be computed exactly, allows us to provide another illustration of the concept of cost, due to the competition of different folds. The relevance of the entropy in relation to directed evolution experiments is stressed.

[1]  W. P. Russ,et al.  Natural-like function in artificial WW domains , 2005, Nature.

[2]  Anthony D. Kelleher,et al.  Human Immunodeficiency Virus Type 1-Specific CD8+ T-Cell Responses during Primary Infection Are Major Determinants of the Viral Set Point and Loss of CD4+ T Cells , 2009, Journal of Virology.

[3]  John P. Barton,et al.  The Fitness Landscape of HIV-1 Gag: Advanced Modeling Approaches and Validation of Model Predictions by In Vitro Testing , 2014, PLoS Comput. Biol..

[4]  T. Hwa,et al.  Identification of direct residue contacts in protein–protein interaction by message passing , 2009, Proceedings of the National Academy of Sciences.

[5]  Sivaraman Balakrishnan,et al.  Learning generative models for protein fold families , 2011, Proteins.

[6]  The UniProt Consortium,et al.  Reorganizing the protein space at the Universal Protein Resource (UniProt) , 2011, Nucleic Acids Res..

[7]  C. Sander,et al.  Direct-coupling analysis of residue coevolution captures native contacts across many protein families , 2011, Proceedings of the National Academy of Sciences.

[8]  Nicholas C. Wu,et al.  A Comprehensive Functional Map of the Hepatitis C Virus Genome Provides a Resource for Probing Viral Proteins , 2014, mBio.

[9]  Anthony D. Keefe,et al.  Functional proteins from a random-sequence library , 2001, Nature.

[10]  Simona Cocco,et al.  Benchmarking Inverse Statistical Approaches for Protein Structure and Design with Exactly Solvable Models , 2015, bioRxiv.

[11]  Alfonso Valencia,et al.  Protein interactions and ligand binding: From protein subfamilies to functional specificity , 2010, Proceedings of the National Academy of Sciences.

[12]  R. Jernigan,et al.  Estimation of effective interresidue contact energies from protein crystal structures: quasi-chemical approximation , 1985 .

[13]  E I Shakhnovich,et al.  Protein design: a perspective from simple tractable models , 1998, Folding & design.

[14]  Ronald M. Levy,et al.  Correlated Electrostatic Mutations Provide a Reservoir of Stability in HIV Protease , 2012, PLoS Comput. Biol..

[15]  N. Wingreen,et al.  Emergence of Preferred Structures in a Simple Model of Protein Folding , 1996, Science.

[16]  G. Stormo,et al.  Correlated mutations in models of protein sequences: phylogenetic and structural effects , 1999 .

[17]  Simona Cocco,et al.  Adaptive Cluster Expansion for Inferring Boltzmann Machines with Noisy Data , 2011, Physical review letters.

[18]  Simona Cocco,et al.  Quantitative theory of entropic forces acting on constrained nucleotide sequences applied to viruses , 2014, Proceedings of the National Academy of Sciences.

[19]  Andrew L. Ferguson,et al.  Translating HIV sequences into quantitative fitness landscapes predicts viral vulnerabilities for rational immunogen design. , 2013, Immunity.

[20]  W. P. Russ,et al.  Evolutionary information for specifying a protein fold , 2005, Nature.

[21]  E. Jaynes On the rationale of maximum-entropy methods , 1982, Proceedings of the IEEE.

[22]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[23]  Igor N. Berezovsky,et al.  Positive and Negative Design in Stability and Thermal Adaptation of Natural Proteins , 2006, PLoS Comput. Biol..

[24]  Todd M. Allen,et al.  Coordinate linkage of HIV evolution reveals regions of immunological vulnerability , 2011, Proceedings of the National Academy of Sciences.

[25]  D. Strick,et al.  Comprehensive Epitope Analysis of Human Immunodeficiency Virus Type 1 (HIV-1)-Specific T-Cell Responses Directed against the Entire Expressed HIV-1 Genome Demonstrate Broadly Directed Responses, but No Correlation to Viral Load , 2003, Journal of Virology.

[26]  Mario Roederer,et al.  Relationship between Functional Profile of HIV-1 Specific CD8 T Cells and Epitope Variability with the Selection of Escape Mutants in Acute HIV-1 Infection , 2011, PLoS pathogens.

[27]  E. Aurell,et al.  Improved contact prediction in proteins: using pseudolikelihoods to infer Potts models. , 2012, Physical review. E, Statistical, nonlinear, and soft matter physics.

[28]  N. Heaton,et al.  Mutational Analysis of Measles Virus Suggests Constraints on Antigenic Variation of the Glycoproteins. , 2015, Cell reports.

[29]  Rong Hai,et al.  Genome-wide mutagenesis of influenza virus reveals unique plasticity of the hemagglutinin and NS1 proteins , 2013, Proceedings of the National Academy of Sciences.

[30]  Tal Pupko,et al.  ConSurf 2010: calculating evolutionary conservation in sequence and structure of proteins and nucleic acids , 2010, Nucleic Acids Res..

[31]  J. Silberg,et al.  A transposase strategy for creating libraries of circularly permuted proteins , 2012, Nucleic acids research.

[32]  Simona Cocco,et al.  From Principal Component to Direct Coupling Analysis of Coevolution in Proteins: Low-Eigenvalue Modes are Needed for Structure Prediction , 2012, PLoS Comput. Biol..

[33]  E. Birney,et al.  Pfam: the protein families database , 2013, Nucleic Acids Res..

[34]  R. Monasson,et al.  Adaptive Cluster Expansion for the Inverse Ising Problem: Convergence, Algorithm and Tests , 2011, 1110.5416.

[35]  Klaus Schulten,et al.  Mature HIV-1 capsid structure by cryo-electron microscopy and all-atom molecular dynamics , 2013, Nature.

[36]  S Cocco,et al.  Large pseudocounts and L2-norm penalties are necessary for the mean-field inference of Ising and Potts models. , 2014, Physical review. E, Statistical, nonlinear, and soft matter physics.

[37]  Jeffrey M. Smith,et al.  High-Resolution Functional Mapping of the Venezuelan Equine Encephalitis Virus Genome by Insertional Mutagenesis and Massively Parallel Sequencing , 2010, PLoS pathogens.

[38]  Hao Li,et al.  Designability of protein structures: A lattice‐model study using the Miyazawa‐Jernigan matrix , 2002, Proteins.

[39]  Zhiqiang Tan,et al.  Deep Sequencing of Protease Inhibitor Resistant HIV Patient Isolates Reveals Patterns of Correlated Mutations in Gag and Protease , 2015, PLoS Comput. Biol..

[40]  L. Holm,et al.  The Pfam protein families database , 2005, Nucleic Acids Res..

[41]  Erik van Nimwegen,et al.  Disentangling Direct from Indirect Co-Evolution of Residues in Protein Alignments , 2010, PLoS Comput. Biol..

[42]  Mehran Kardar,et al.  Scaling laws describe memories of host–pathogen riposte in the HIV population , 2015, Proceedings of the National Academy of Sciences.

[43]  A. Valencia,et al.  Correlated mutations contain information about protein-protein interaction. , 1997, Journal of molecular biology.

[44]  Eugene I Shakhnovich,et al.  Structural determinant of protein designability. , 2002, Physical review letters.

[45]  A. Valencia,et al.  Emerging methods in protein co-evolution , 2013, Nature Reviews Genetics.

[46]  Eugene I. Shakhnovich,et al.  Enumeration of all compact conformations of copolymers with random sequence of links , 1990 .

[47]  Arup K Chakraborty,et al.  Spin models inferred from patient-derived viral sequence data faithfully describe HIV fitness landscapes. , 2013, Physical review. E, Statistical, nonlinear, and soft matter physics.

[48]  Haruki Nakamura,et al.  The Protein Data Bank at 40: reflecting on the past to prepare for the future. , 2012, Structure.

[49]  Feng Gao,et al.  Vertical T cell immunodominance and epitope entropy determine HIV-1 escape. , 2012, The Journal of clinical investigation.

[50]  W. Bialek Biophysics: Searching for Principles , 2012 .

[51]  B. Korber,et al.  Evolutionary and immunological implications of contemporary HIV-1 variation. , 2001, British medical bulletin.