3D deep convolutional neural networks for amino acid environment similarity analysis

BackgroundCentral to protein biology is the understanding of how structural elements give rise to observed function. The surfeit of protein structural data enables development of computational methods to systematically derive rules governing structural-functional relationships. However, performance of these methods depends critically on the choice of protein structural representation. Most current methods rely on features that are manually selected based on knowledge about protein structures. These are often general-purpose but not optimized for the specific application of interest.In this paper, we present a general framework that applies 3D convolutional neural network (3DCNN) technology to structure-based protein analysis. The framework automatically extracts task-specific features from the raw atom distribution, driven by supervised labels. As a pilot study, we use our network to analyze local protein microenvironments surrounding the 20 amino acids, and predict the amino acids most compatible with environments within a protein structure. To further validate the power of our method, we construct two amino acid substitution matrices from the prediction statistics and use them to predict effects of mutations in T4 lysozyme structures.ResultsOur deep 3DCNN achieves a two-fold increase in prediction accuracy compared to models that employ conventional hand-engineered features and successfully recapitulates known information about similar and different microenvironments. Models built from our predictions and substitution matrices achieve an 85% accuracy predicting outcomes of the T4 lysozyme mutation variants. Our substitution matrices contain rich information relevant to mutation analysis compared to well-established substitution matrices. Finally, we present a visualization method to inspect the individual contributions of each atom to the classification decisions.ConclusionsEnd-to-end trained deep learning networks consistently outperform methods using hand-engineered features, suggesting that the 3DCNN framework is well suited for analysis of protein microenvironments and may be useful for other protein structural analyses.

[1]  M. O. Dayhoff A model of evolutionary change in protein , 1978 .

[2]  Lode Wyns,et al.  Impact of natural variation in bacterial F17G adhesins on crystallization behaviour. , 2005, Acta crystallographica. Section D, Biological crystallography.

[3]  O. Troyanskaya,et al.  Predicting effects of noncoding variants with deep learning–based sequence model , 2015, Nature Methods.

[4]  B. Matthews,et al.  Structure of bacteriophage T4 lysozyme refined at 1.7 A resolution. , 1987, Journal of molecular biology.

[5]  D. E. Anderson,et al.  Hydrophobic core repacking and aromatic–aromatic interaction in the thermostable mutant of T4 lysozyme ser 117 → phe , 1993, Protein science : a publication of the Protein Society.

[6]  B. Matthews,et al.  Enhanced protein thermostability from designed mutations that interact with α-helix dipoles , 1990, Nature.

[7]  Russ B. Altman,et al.  Automated Construction of Structural Motifs for Predicting Functional Sites on Protein Structures , 2003, Pacific Symposium on Biocomputing.

[8]  R. Altman,et al.  Using the radial distributions of physical features to compare amino acid environments and align amino acid sequences. , 1997, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[9]  T. L. Blundell,et al.  Hormone families: pancreatic hormones and homologous growth factors , 1980, Nature.

[10]  Graham Davies,et al.  Development and Validation of a Risk Model for Predicting Adverse Drug Reactions in Older People during Hospital Stay: Brighton Adverse Drug Reactions Risk (BADRI) Model , 2014, PloS one.

[11]  The Uniprot Consortium,et al.  UniProt: a hub for protein information , 2014, Nucleic Acids Res..

[12]  M J Sternberg,et al.  Analysis and prediction of the location of catalytic residues in enzymes. , 1988, Protein engineering.

[13]  Vijay S. Pande,et al.  Molecular graph convolutions: moving beyond fingerprints , 2016, Journal of Computer-Aided Molecular Design.

[14]  Stephen L Mayo,et al.  Repacking the Core of T4 lysozyme by automated design. , 2003, Journal of molecular biology.

[15]  Travis E. Oliphant,et al.  Python for Scientific Computing , 2007, Computing in Science & Engineering.

[16]  B. Matthews,et al.  Structural and thermodynamic consequences of burying a charged residue within the hydrophobic core of T4 lysozyme. , 1991, Biochemistry.

[17]  D G Vassylyev,et al.  Crystal structure of troponin C in complex with troponin I fragment at 2.3-A resolution. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[18]  Yoshua Bengio,et al.  Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[19]  B. Frey,et al.  Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning , 2015, Nature Biotechnology.

[20]  M. O. Dayhoff,et al.  22 A Model of Evolutionary Change in Proteins , 1978 .

[21]  John Scott Bridle,et al.  Probabilistic Interpretation of Feedforward Classification Network Outputs, with Relationships to Statistical Pattern Recognition , 1989, NATO Neurocomputing.

[22]  E J Dodson,et al.  Role of B13 Glu in insulin assembly. The hexamer structure of recombinant mutant (B13 Glu-->Gln) insulin. , 1992, Journal of molecular biology.

[23]  B. Matthews,et al.  Structural analysis of the temperature-sensitive mutant of bacteriophage T4 lysozyme, glycine 156----aspartic acid. , 1988, The Journal of biological chemistry.

[24]  Patrice Koehl,et al.  The ASTRAL compendium for protein structure and sequence analysis , 2000, Nucleic Acids Res..

[25]  Arnaldo J. Montagner,et al.  STING Report: convenient web-based application for graphic and tabular presentations of protein sequence, structure and function descriptors from the STING database , 2005, Nucleic Acids Res..

[26]  Izhar Wallach,et al.  AtomNet: A Deep Convolutional Neural Network for Bioactivity Prediction in Structure-based Drug Discovery , 2015, ArXiv.

[27]  Russ B. Altman,et al.  High Precision Prediction of Functional Sites in Protein Structures , 2014, PloS one.

[28]  Eyke Hüllermeier,et al.  Physicochemical descriptors to discriminate protein–protein interactions in permanent and transient complexes selected by means of machine learning algorithms , 2006, Proteins.

[29]  B. Matthews,et al.  Enhanced protein thermostability from site-directed mutations that decrease the entropy of unfolding. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[30]  Steven E Brenner,et al.  SCOPe: Manual Curation and Artifact Removal in the Structural Classification of Proteins - extended Database. , 2017, Journal of molecular biology.

[31]  Dariya S. Glazer,et al.  The FEATURE framework for protein function annotation: modeling new functions, improving performance, and extending to novel applications , 2008, BMC Genomics.

[32]  Brian W. Matthews,et al.  Hydrophobic stabilization in T4 lysozyme determined directly by multiple substitutions of Ile 3 , 1988, Nature.

[33]  Robert H. Kretsinger,et al.  Evolution of EF-hand calcium-modulated proteins. I. Relationships based on amino acid sequences , 1990, Journal of Molecular Evolution.

[34]  J M Masson,et al.  Alanine-stretch scanning mutagenesis: a simple and efficient method to probe protein structure and function. , 1997, Nucleic acids research.

[35]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  Achim Krüger,et al.  Design of Novel and Selective Inhibitors of Urokinase-type Plasminogen Activator with Improved Pharmacokinetic Properties for Use as Antimetastatic Agents*[boxs] , 2004, Journal of Biological Chemistry.

[37]  Brian W Matthews,et al.  Contributions of all 20 amino acids at site 96 to the stability and structure of T4 lysozyme , 2009, Protein science : a publication of the Protein Society.

[38]  Russ B. Altman,et al.  Using Multiple Microenvironments to Find Similar Ligand-Binding Sites: Application to Kinase Inhibitor Binding , 2011, PLoS Comput. Biol..

[39]  R. Altman,et al.  Characterizing the microenvironment surrounding protein sites , 1995, Protein science : a publication of the Protein Society.

[40]  L. A. Lipscomb,et al.  Context‐dependent protein stabilization by methionine‐to‐leucine substitution shown in T4 lysozyme , 1998, Protein science : a publication of the Protein Society.

[41]  B. Matthews,et al.  Structural analysis of a non-contiguous second-site revertant in T4 lysozyme shows that increasing the rigidity of a protein can enhance its stability. , 1999, Journal of molecular biology.

[42]  B. Matthews,et al.  Structural and thermodynamic analysis of the binding of solvent at internal sites in T4 lysozyme , 2001, Protein science : a publication of the Protein Society.

[43]  Dale E Tronrud,et al.  Lessons from the lysozyme of phage T4 , 2010, Protein science : a publication of the Protein Society.

[44]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[45]  John Salvatier,et al.  Theano: A Python framework for fast computation of mathematical expressions , 2016, ArXiv.

[46]  Lawrence D. Jackel,et al.  Handwritten Digit Recognition with a Back-Propagation Network , 1989, NIPS.

[47]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[48]  Kurt S. Thorn,et al.  ASEdb: a database of alanine mutations and their effects on the free energy of binding in protein interactions , 2001, Bioinform..

[49]  B. Matthews,et al.  Structure of a hinge-bending bacteriophage T4 lysozyme mutant, Ile3-->Pro. , 1993, Journal of molecular biology.

[50]  Li Li,et al.  Deep Patient: An Unsupervised Representation to Predict the Future of Patients from the Electronic Health Records , 2016, Scientific Reports.

[51]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[52]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[53]  Sven Behnke,et al.  Evaluation of Pooling Operations in Convolutional Architectures for Object Recognition , 2010, ICANN.

[54]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[55]  Christoph Janiak,et al.  A critical account on π–π stacking in metal complexes with aromatic nitrogen-containing ligands , 2000 .

[56]  S. Bouvier,et al.  Systematic mutation of bacteriophage T4 lysozyme. , 1991, Journal of molecular biology.

[57]  R. Russell,et al.  Amino‐Acid Properties and Consequences of Substitutions , 2003 .

[58]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[59]  Ying Gao,et al.  Bioinformatics Applications Note Sequence Analysis Cd-hit Suite: a Web Server for Clustering and Comparing Biological Sequences , 2022 .

[60]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[61]  Maksymilian Chruszcz,et al.  Benefits of structural genomics for drug discovery research. , 2009, Infectious disorders drug targets.

[62]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[63]  J A Wozniak,et al.  High-resolution structure of the temperature-sensitive mutant of phage lysozyme, Arg 96----His. , 1989, Biochemistry.

[64]  Amos Bairoch,et al.  PROSITE, a protein domain database for functional characterization and annotation , 2009, Nucleic Acids Res..

[65]  S. Martin,et al.  Altering protein specificity: techniques and applications. , 2005, Bioorganic & medicinal chemistry.

[66]  Russ B. Altman,et al.  Knowledge-based Fragment Binding Prediction , 2014, PLoS Comput. Biol..

[67]  B. Matthews,et al.  Analysis of the interaction between charged side chains and the alpha-helix dipole using designed thermostable mutants of phage T4 lysozyme. , 1991, Biochemistry.

[68]  B. Matthews,et al.  Design and structural analysis of alternative hydrophobic core packing arrangements in bacteriophage T4 lysozyme. , 1993, Journal of molecular biology.

[69]  Santiago Alvarez,et al.  A cartography of the van der Waals territories. , 2013, Dalton transactions.

[70]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[71]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[72]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[73]  R. Tibshirani,et al.  Regression shrinkage and selection via the lasso: a retrospective , 2011 .

[74]  B. Matthews,et al.  Structural studies of mutants of the lysozyme of bacteriophage T4. The temperature-sensitive mutant protein Thr157----Ile. , 1987, Journal of molecular biology.

[75]  B. Matthews,et al.  Structures of randomly generated mutants of T4 lysozyme show that protein stability can be enhanced by relaxation of strain and by improved hydrogen bonding via bound solvent , 1993, Protein science : a publication of the Protein Society.

[76]  H. Kawasaki,et al.  Calcium-binding proteins. 1: EF-hands. , 1994, Protein profile.

[77]  B. Matthews,et al.  Methionine and alanine substitutions show that the formation of wild-type-like structure in the carboxy-terminal domain of T4 lysozyme is a rate-limiting step in folding. , 1999, Biochemistry.

[78]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[79]  Alán Aspuru-Guzik,et al.  Convolutional Networks on Graphs for Learning Molecular Fingerprints , 2015, NIPS.

[80]  Andrew Zisserman,et al.  Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , 2013, ICLR.