Deep clustering of protein folding simulations

We examine the problem of clustering biomolecular simulations using deep learning techniques. Since biomolecular simulation datasets are inherently high dimensional, it is often necessary to build low dimensional representations that can be used to extract quantitative insights into the atomistic mechanisms that underlie complex biological processes. In this paper, we use a convolutional variational autoencoder (CVAE) to learn low dimensional, biophysically relevant latent features from long time-scale protein folding simulations in an unsupervised manner. We demonstrate our approach on three model protein folding systems, namely the Fs-peptide (14 μs aggregate sampling), villin head piece (single trajectory of 125 μs) and the mixed β-β-α (BBA) protein (223 + 102 μs sampling across two independent trajectories). In these systems, we show that the CVAE latent features learned correspond to distinct conformational substates along the protein folding pathways. The CVAE model predicts nearly 89% of all contacts within the folding trajectories correctly, while being able to extract folded, unfolded and potentially misfolded states in an unsupervised manner. Further, the CVAE model can be used to learn latent features of protein folding that can be applied to other independent trajectories, making it particularly attractive for identifying intrinsic features that correspond to conformational substates that share similar structural features. Together, we show that the CVAE model can quantitatively describe complex biophysical processes such as protein folding.

[1]  Donald R. Jones,et al.  Efficient Global Optimization of Expensive Black-Box Functions , 1998, J. Glob. Optim..

[2]  Pierre Baldi,et al.  Autoencoders, Unsupervised Learning, and Deep Architectures , 2011, ICML Unsupervised and Transfer Learning.

[3]  Jörg Gsponer,et al.  Molecular dynamics simulations of protein folding from the transition state , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[4]  Vijay S. Pande,et al.  Massively Multitask Networks for Drug Discovery , 2015, ArXiv.

[5]  J. P. Grossman,et al.  Biomolecular simulation: a computational microscope for molecular biology. , 2012, Annual review of biophysics.

[6]  Jasper Snoek,et al.  Practical Bayesian Optimization of Machine Learning Algorithms , 2012, NIPS.

[7]  Thomas J Lane,et al.  MSMBuilder2: Modeling Conformational Dynamics at the Picosecond to Millisecond Scale. , 2011, Journal of chemical theory and computation.

[8]  Arvind Ramanathan,et al.  Quasi-Anharmonic Analysis Reveals Intermediate States in the Nuclear Co-Activator Receptor Binding Domain Ensemble , 2012, Pacific Symposium on Biocomputing.

[9]  Oliver Beckstein,et al.  MDAnalysis: A toolkit for the analysis of molecular dynamics simulations , 2011, J. Comput. Chem..

[10]  Stewart A. Adcock,et al.  Molecular dynamics: survey of methods for simulating the activity of proteins. , 2006, Chemical reviews.

[11]  Lydia E Kavraki,et al.  Low-dimensional, free-energy landscapes of protein-folding reactions by nonlinear dimensionality reduction , 2006, Proc. Natl. Acad. Sci. USA.

[12]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[13]  Arvind Ramanathan,et al.  Protein conformational populations and functionally relevant substates. , 2014, Accounts of chemical research.

[14]  R. Dror,et al.  Improved side-chain torsion potentials for the Amber ff99SB protein force field , 2010, Proteins.

[15]  Li Han,et al.  Evaluation of Dimensionality-reduction Methods from Peptide Folding-unfolding Simulations. , 2013, Journal of chemical theory and computation.

[16]  Carmeline J. Dsilva,et al.  Systematic characterization of protein folding pathways using diffusion maps: application to Trp-cage miniprotein. , 2015, The Journal of chemical physics.

[17]  Andrej J. Savol,et al.  Event detection and sub‐state discovery from biomolecular simulations using higher‐order statistics: Application to enzyme adenylate kinase , 2012, Proteins.

[18]  Gianni De Fabritiis,et al.  Dimensionality reduction methods for molecular simulations , 2017, ArXiv.

[19]  Arvind Ramanathan,et al.  Statistical Inference for Big Data Problems in Molecular Biophysics , 2012 .

[20]  Gisbert Schneider,et al.  Deep Learning in Drug Discovery , 2016, Molecular informatics.

[21]  Mohammad M. Sultan,et al.  Variational encoding of complex dynamics. , 2017, Physical review. E.

[22]  Andrej J. Savol,et al.  Quantifying the Sources of Kinetic Frustration in Folding Simulations of Small Proteins , 2014, Journal of chemical theory and computation.

[23]  Carl Doersch,et al.  Tutorial on Variational Autoencoders , 2016, ArXiv.

[24]  Helgi I Ingólfsson,et al.  Computational ‘microscopy’ of cellular membranes , 2016, Journal of Cell Science.

[25]  Vijay S. Pande,et al.  Atomic Convolutional Networks for Predicting Protein-Ligand Binding Affinity , 2017, ArXiv.

[26]  Vijay S Pande,et al.  Progress and challenges in the automated construction of Markov state models for full protein systems. , 2009, The Journal of chemical physics.

[27]  M Vendruscolo,et al.  Recovery of protein structure from contact maps. , 1997, Folding & design.

[28]  Arvind Ramanathan,et al.  On-the-Fly Identification of Conformational Substates from Molecular Dynamics Simulations. , 2011, Journal of chemical theory and computation.

[29]  Yoshua Bengio,et al.  Algorithms for Hyper-Parameter Optimization , 2011, NIPS.

[30]  Arvind Ramanathan,et al.  Discovering Conformational Sub-States Relevant to Protein Function , 2011, PloS one.

[31]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[32]  Klaus Schulten,et al.  Discovery through the computational microscope. , 2009, Structure.

[33]  A. Liwo,et al.  Principal component analysis for protein folding dynamics. , 2009, Journal of molecular biology.

[34]  Oliver Beckstein,et al.  MDAnalysis: A Python Package for the Rapid Analysis of Molecular Dynamics Simulations , 2016, SciPy.

[35]  Eytan Domany,et al.  Protein folding in contact map space , 2000 .