Mean Dimension of Generative Models for Protein Sequences

Generative models for protein sequences are important for protein design, mutational effect prediction and structure prediction. In all of these tasks, the introduction of models which include interactions between pairs of positions has had a major impact over the last decade. More recently, many methods going beyond pairwise models have been developed, for example by using neural networks that are in principle able to capture interactions between more than two positions from multiple sequence alignments. However, not much is known about the inter-dependency patterns between positions in these models, and how important higher-order interactions involving more than two positions are for their performance. In this work, we introduce the notion of mean dimension for generative models for protein sequences, which measures the average number of positions involved in interactions when weighted by their contribution to the total variance in log probability of the model. We estimate the mean dimension for different model classes trained on different protein families, relate it to the performance of the models on mutational effect prediction tasks and also trace its evolution during training. The mean dimension is related to the performance of models in biological prediction tasks and can highlight differences between model classes even if their performance in the prediction task is similar. The overall low mean dimension indicates that well-performing models are not necessarily of high complexity and encourages further work in interpreting their performance in biological terms.

[1]  C. Feinauer,et al.  The Mean Dimension of Neural Networks - What causes the interaction effects? , 2022, ArXiv.

[2]  Aidan N. Gomez,et al.  Disease variant prediction with deep generative models of evolutionary data , 2021, Nature.

[3]  C. Lucibello,et al.  Interpretable pairwise distillations for generative protein sequence models , 2021, bioRxiv.

[4]  A. Pagnani,et al.  Efficient generative modeling of protein sequences using simple autoregressive models , 2021, Nature Communications.

[5]  Slobodan Vucetic,et al.  The generative capacity of probabilistic protein sequence models , 2020, Nature Communications.

[6]  Haobo Wang,et al.  The structure-fitness landscape of pairwise relations in generative sequence models , 2020, bioRxiv.

[7]  Simona Cocco,et al.  An evolution-based model for designing chorismate mutase enzymes , 2020, Science.

[8]  Christopher Hoyt,et al.  Efficient estimation of the ANOVA mean dimension, with an application to neural net classification , 2020, SIAM/ASA J. Uncertain. Quantification.

[9]  Nikhil Naik,et al.  ProGen: Language Modeling for Protein Generation , 2020, bioRxiv.

[10]  Xinqiang Ding,et al.  Deciphering protein evolution and fitness landscapes with latent space models , 2019, Nature Communications.

[11]  Aleksej Zelezniak,et al.  Expanding functional protein sequence space using generative adversarial networks , 2019, bioRxiv.

[12]  Adam J. Riesselman,et al.  Deep generative models of genetic variation capture the effects of mutations , 2018, Nature Methods.

[13]  E. Aurell,et al.  DCA for genome-wide epistasis analysis: the statistical genetics perspective , 2018, Physical biology.

[14]  Simona Cocco,et al.  Inverse statistical physics of protein sequences: a key issues review , 2017, Reports on progress in physics. Physical Society.

[15]  M. Weigt,et al.  Context-Aware Prediction of Pathogenicity of Missense Mutations Involved in Human Disease , 2017, bioRxiv.

[16]  Thomas A. Hopf,et al.  Mutation effects predicted from sequence co-variation , 2017, Nature Biotechnology.

[17]  Andrea Pagnani,et al.  Inter-Protein Sequence Co-Evolution Predicts Known Physical Interactions in Bacterial Ribosomes and the Trp Operon , 2015, PloS one.

[18]  Ruslan Salakhutdinov,et al.  Importance Weighted Autoencoders , 2015, ICLR.

[19]  Debora S. Marks,et al.  Sequence co-evolution gives 3D contacts and structures of protein complexes , 2014, bioRxiv.

[20]  Marcin J. Skwark,et al.  Improving Contact Prediction along Three Dimensions , 2014, PLoS Comput. Biol..

[21]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[22]  E. Aurell,et al.  Improved contact prediction in proteins: using pseudolikelihoods to infer Potts models. , 2012, Physical review. E, Statistical, nonlinear, and soft matter physics.

[23]  Thomas A. Hopf,et al.  Protein 3D Structure Computed from Evolutionary Sequence Variation , 2011, PloS one.

[24]  C. Sander,et al.  Direct-coupling analysis of residue coevolution captures native contacts across many protein families , 2011, Proceedings of the National Academy of Sciences.

[25]  A. Owen,et al.  Estimating Mean Dimensionality of Analysis of Variance Decompositions , 2006 .

[26]  H. Rabitz,et al.  General foundations of high‐dimensional model representations , 1999 .

[27]  B. Efron,et al.  The Jackknife Estimate of Variance , 1981 .

[28]  T. Sanders,et al.  Analysis of Boolean Functions , 2012, ArXiv.