Evaluating Autoencoder-Based Featurization and Supervised Learning for Protein Decoy Selection

Rapid growth in molecular structure data is renewing interest in featurizing structure. Featurizations that retain information on biological activity are particularly sought for protein molecules, where decades of research have shown that indeed structure encodes function. Research on featurization of protein structure is active, but here we assess the promise of autoencoders. Motivated by rapid progress in neural network research, we investigate and evaluate autoencoders on yielding linear and nonlinear featurizations of protein tertiary structures. An additional reason we focus on autoencoders as the engine to obtain featurizations is the versatility of their architectures and the ease with which changes to architecture yield linear versus nonlinear features. While open-source neural network libraries, such as Keras, which we employ here, greatly facilitate constructing, training, and evaluating autoencoder architectures and conducting model search, autoencoders have not yet gained popularity in the structure biology community. Here we demonstrate their utility in a practical context. Employing autoencoder-based featurizations, we address the classic problem of decoy selection in protein structure prediction. Utilizing off-the-shelf supervised learning methods, we demonstrate that the featurizations are indeed meaningful and allow detecting active tertiary structures, thus opening the way for further avenues of research.

[1]  Amarda Shehu,et al.  Computing energy landscape maps and structural excursions of proteins , 2016, BMC Genomics.

[2]  Inbal Budowski-Tal,et al.  FragBag, an accurate representation of protein structure, retrieves structural neighbors from the entire PDB quickly and accurately , 2010, Proceedings of the National Academy of Sciences.

[3]  M. Maggioni,et al.  Determination of reaction coordinates via locally scaled diffusion map. , 2011, The Journal of chemical physics.

[4]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[5]  D. Boehr,et al.  How Do Proteins Interact? , 2008, Science.

[6]  Haruki Nakamura,et al.  Announcing the worldwide Protein Data Bank , 2003, Nature Structural Biology.

[7]  Amarda Shehu,et al.  A Survey of Computational Methods for Protein Function Prediction , 2016 .

[8]  Gareth A. Tribello,et al.  Using Dimensionality Reduction to Analyze Protein Trajectories , 2019, Front. Mol. Biosci..

[9]  Erion Plaku,et al.  Structure-Guided Protein Transition Modeling with a Probabilistic Roadmap Algorithm , 2018, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[10]  Shantenu Jha,et al.  CoCo-MD: A Simple and Effective Method for the Enhanced Sampling of Conformational Space. , 2019, Journal of chemical theory and computation.

[11]  Wei Chen,et al.  Collective variable discovery and enhanced sampling using autoencoders: Innovations in network architecture and error function design. , 2018, The Journal of chemical physics.

[12]  F. Noé,et al.  Collective variables for the study of long-time kinetics from molecular trajectories: theory and methods. , 2017, Current opinion in structural biology.

[13]  Christine Peter,et al.  EncoderMap: Dimensionality Reduction and Generation of Molecule Conformations. , 2019, Journal of chemical theory and computation.

[14]  Pascal Vincent,et al.  Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010, J. Mach. Learn. Res..

[15]  Eric O. Postma,et al.  Dimensionality Reduction: A Comparative Review , 2008 .

[16]  Lydia E. Kavraki,et al.  Understanding Protein Flexibility through Dimensionality Reduction , 2003, J. Comput. Biol..

[17]  Amarda Shehu,et al.  From Optimization to Mapping: An Evolutionary Algorithm for Protein Energy Landscapes , 2018, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[18]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[19]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[20]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[21]  David A. Lee,et al.  CATH: an expanded resource to predict protein function through structure and sequence , 2016, Nucleic Acids Res..

[22]  A. D. McLachlan,et al.  A mathematical procedure for superimposing atomic coordinates of proteins , 1972 .

[23]  Amarda Shehu,et al.  A Data-Driven Evolutionary Algorithm for Mapping Multibasin Protein Energy Landscapes , 2015, J. Comput. Biol..

[24]  K Schulten,et al.  VMD: visual molecular dynamics. , 1996, Journal of molecular graphics.

[25]  Bonnie Berger,et al.  Learning protein sequence embeddings using information from structure , 2019, ICLR.

[26]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[27]  J Andrew McCammon,et al.  Large conformational changes in proteins: signaling and other functions. , 2010, Current opinion in structural biology.

[28]  Giacomo Fiorin,et al.  Using collective variables to drive molecular dynamics simulations , 2013 .

[29]  Amarda Shehu,et al.  Exploring representations of protein structure for automated remote homology detection and mapping of protein structure space , 2014, BMC Bioinformatics.

[30]  J. Tenenbaum,et al.  A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.

[31]  Xiaogen Zhou,et al.  Secondary Structure and Contact Guided Differential Evolution for Protein Structure Prediction , 2020, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[32]  Amarda Shehu,et al.  Balancing multiple objectives in conformation sampling to control decoy diversity in template-free protein structure prediction , 2019, BMC Bioinformatics.

[33]  Kurt Hornik,et al.  Neural networks and principal component analysis: Learning from examples without local minima , 1989, Neural Networks.

[34]  Margarita Osadchy,et al.  Maps of protein structure space reveal a fundamental relationship between protein structure and function , 2011, Proceedings of the National Academy of Sciences.

[35]  Erion Plaku,et al.  Sample-Based Models of Protein Energy Landscapes and Slow Structural Rearrangements , 2018, J. Comput. Biol..

[36]  C. Clementi,et al.  Discovering mountain passes via torchlight: methods for the definition of reaction coordinates and pathways in complex macromolecular reactions. , 2013, Annual review of physical chemistry.

[37]  Amarda Shehu,et al.  Learning Organizations of Protein Energy Landscapes: An Application on Decoy Selection in Template-Free Protein Structure Prediction. , 2019, Methods in molecular biology.

[38]  Jean-Paul Watson,et al.  Algorithmic dimensionality reduction for molecular structure analysis. , 2008, The Journal of chemical physics.

[39]  H. Berendsen,et al.  Essential dynamics of proteins , 1993, Proteins.

[40]  Jean-Christophe Nebel,et al.  Reduced Fragment Diversity for Alpha and Alpha-Beta Protein Structure Prediction using Rosetta. , 2017, Protein and peptide letters.

[41]  Nasrin Akhter,et al.  From Extraction of Local Structures of Protein Energy Landscapes to Improved Decoy Selection in Template-Free Protein Structure Prediction , 2018, Molecules.

[42]  Ruth Nussinov,et al.  Mapping the Conformation Space of Wildtype and Mutant H-Ras with a Memetic, Cellular, and Multiscale Evolutionary Algorithm , 2015, PLoS Comput. Biol..

[43]  Jens Meiler,et al.  ROSETTA3: an object-oriented software suite for the simulation and design of macromolecules. , 2011, Methods in enzymology.

[44]  Lydia E Kavraki,et al.  Fast and reliable analysis of molecular motion using proximity relations and dimensionality reduction , 2007, Proteins.