Prediction of peptide binding to MHC using machine learning with sequence and structure-based feature sets.

Selecting peptides that bind strongly to the major histocompatibility complex (MHC) for inclusion in a vaccine has therapeutic potential for infections and tumors. Machine learning models trained on sequence data exist for peptide:MHC (p:MHC) binding predictions. Here, we train support vector machine classifier (SVMC) models on physicochemical sequence-based and structure-based descriptor sets to predict peptide binding to a well-studied model mouse MHC I allele, H-2Db. Recursive feature elimination and two-way forward feature selection were also performed. Although low on sensitivity compared to the current state-of-the-art algorithms, models based on physicochemical descriptor sets achieve specificity and precision comparable to the most popular sequence-based algorithms. The best-performing model is a hybrid descriptor set containing both sequence-based and structure-based descriptors. Interestingly, close to half of the physicochemical sequence-based descriptors remaining in the hybrid model were properties of the anchor positions, residues 5 and 9 in the peptide sequence. In contrast, residues flanking position 5 make little to no residue-specific contribution to the binding affinity prediction. The results suggest that machine-learned models incorporating both sequence-based descriptors and structural data may provide information on specific physicochemical properties determining binding affinities.

[1]  J. Ramstein,et al.  Energetic coupling between DNA bending and base pair opening. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[2]  Ora Schueler-Furman,et al.  Structure-Based Identifica tion of HDAC 8 Non-histone Substrates Graphical Abstract Highlights , 2016 .

[3]  Christophe Chipot,et al.  Cation−π Interactions in Proteins: Can Simple Models Provide an Accurate Description? , 1999 .

[4]  Zhiliang Li,et al.  Factor Analysis Scale of Generalized Amino Acid Information as the Source of a New Set of Descriptors for Elucidating the Structure and Activity Relationships of Cationic Antimicrobial Peptides , 2007 .

[5]  A. Sali,et al.  Statistical potential for assessment and prediction of protein structures , 2006, Protein science : a publication of the Protein Society.

[6]  David Baker,et al.  Protein Structure Prediction Using Rosetta , 2004, Numerical Computer Methods, Part D.

[7]  Kengo Kinoshita,et al.  Community-wide assessment of protein-interface modeling suggests improvements to design methodology. , 2011, Journal of molecular biology.

[8]  M. Sternberg,et al.  Modelling protein docking using shape complementarity, electrostatics and biochemical information. , 1997, Journal of molecular biology.

[9]  M. Cuendet,et al.  Structure-Based, Rational Design of T Cell Receptors , 2013, Front. Immunol..

[10]  Gabriele Cruciani,et al.  Peptide studies by means of principal properties of amino acids derived from MIF descriptors , 2004 .

[11]  S. L. Mayo,et al.  DREIDING: A generic force field for molecular simulations , 1990 .

[12]  A. Warshel,et al.  Calculations of electrostatic interactions in biological systems and in solutions , 1984, Quarterly Reviews of Biophysics.

[13]  Shengshi Z. Li,et al.  A new set of amino acid descriptors and its application in peptide QSARs. , 2005, Biopolymers.

[14]  M. Delorenzi,et al.  An HMM model for coiled-coil domains and a comparison with PSSM-based predictions , 2002, Bioinform..

[15]  G. Crooks,et al.  WebLogo: a sequence logo generator. , 2004, Genome research.

[16]  Samuel L. DeLuca,et al.  Practically Useful: What the Rosetta Protein Modeling Suite Can Do for You , 2010, Biochemistry.

[17]  Alessandro Sette,et al.  The Immune Epitope Database (IEDB): 2018 update , 2018, Nucleic Acids Res..

[18]  S. L. Mayo,et al.  Automated design of the surface positions of protein helices , 1997, Protein science : a publication of the Protein Society.

[19]  Ora Schueler-Furman,et al.  Identification of a Novel Class of Farnesylation Targets by Structure-Based Modeling of Binding Specificity , 2011, PLoS Comput. Biol..

[20]  A. Warshel,et al.  Macroscopic models for studies of electrostatic interactions in proteins: limitations and applicability. , 1984, Proceedings of the National Academy of Sciences of the United States of America.

[21]  Marialva Sinigaglia,et al.  DockTope: a Web-based tool for automated pMHC-I modelling , 2015, Scientific Reports.

[22]  Doheon Lee,et al.  Specificity of molecular interactions in transient protein–protein interaction interfaces , 2006, Proteins.

[23]  Nir London,et al.  In silico and in vitro elucidation of BH3 binding specificity toward Bcl-2. , 2012, Biochemistry.

[24]  J. Sidney,et al.  Genomic and bioinformatic profiling of mutational neoepitopes reveals new rules to predict anticancer immunogenicity , 2014, The Journal of experimental medicine.

[25]  T. Schumacher,et al.  Neoantigens in cancer immunotherapy , 2015, Science.

[26]  S. Wold,et al.  New chemical descriptors relevant for the design of biologically active peptides. A multivariate characterization of 87 amino acids. , 1998, Journal of medicinal chemistry.

[27]  H. Kaiser The varimax criterion for analytic rotation in factor analysis , 1958 .

[28]  Nathan A. Baker,et al.  PDB2PQR: an automated pipeline for the setup of Poisson-Boltzmann electrostatics calculations , 2004, Nucleic Acids Res..

[29]  Rodrigo Torres,et al.  Peptides: A Package for Data Mining of Antimicrobial Peptides , 2015, R J..

[30]  J. Tainer,et al.  The interdependence of protein surface topography and bound water molecules revealed by surface accessibility and fractal density measures. , 1992, Journal of molecular biology.

[31]  Andrea Zaliani,et al.  MS-WHIM Scores for Amino Acids: A New 3D-Description for Peptide QSAR and QSPR Studies , 1999, J. Chem. Inf. Comput. Sci..

[32]  Roland L. Dunbrack,et al.  The Rosetta all-atom energy function for macromolecular modeling and design , 2017, bioRxiv.

[33]  Alexander G. Georgiev,et al.  Interpretable Numerical Descriptors of Amino Acid Space , 2009, J. Comput. Biol..

[34]  Ora Schueler-Furman,et al.  Modeling Peptide-Protein Structure and Binding Using Monte Carlo Sampling Approaches: Rosetta FlexPepDock and FlexPepBind. , 2017, Methods in molecular biology.

[35]  Julie C. Mitchell,et al.  Using physical potentials and learned models to distinguish native binding interfaces from de novo designed interfaces that do not bind , 2013, Proteins.

[36]  H. Scheraga,et al.  Statistical analysis of the physical properties of the 20 naturally occurring amino acids , 1985 .

[37]  H. Rammensee,et al.  SYFPEITHI: database for MHC ligands and peptide motifs , 1999, Immunogenetics.

[38]  A. Vitiello,et al.  The relationship between class I binding affinity and immunogenicity of potential cytotoxic T cell epitopes. , 1994, Journal of immunology.

[39]  M. Nielsen,et al.  NetMHCpan-4.0: Improved Peptide–MHC Class I Interaction Predictions Integrating Eluted Ligand and Peptide Binding Affinity Data , 2017, The Journal of Immunology.

[40]  L. T. Ten Eyck,et al.  Rapid atomic density methods for molecular shape characterization. , 2001, Journal of molecular graphics & modelling.

[41]  Julie C. Mitchell,et al.  Feature Design for Protein Interface hotspots using KFC2 and Rosetta , 2019, bioRxiv.

[42]  Isidro Cortes-Ciriano,et al.  Benchmarking of protein descriptor sets in proteochemometric modeling (part 2): modeling performance of 13 amino acid descriptor sets , 2013, Journal of Cheminformatics.

[43]  Nir London,et al.  Sub‐angstrom modeling of complexes between flexible peptides and globular proteins , 2010, Proteins.

[44]  Alex Rubinsteyn,et al.  MHCflurry: Open-Source Class I MHC Binding Affinity Prediction. , 2018, Cell systems.

[45]  M. Karplus,et al.  CHARMM: A program for macromolecular energy, minimization, and dynamics calculations , 1983 .

[46]  R. H. Ritchie,et al.  Dielectric effects in biopolymers: The theory of ionic saturation revisited , 1985 .

[47]  Masayuki Hata,et al.  Implementation of π‐π interactions in molecular dynamics simulation , 2007, J. Comput. Chem..

[48]  Gerard J. P. van Westen,et al.  Benchmarking of protein descriptor sets in proteochemometric modeling (part 1): comparative study of 13 amino acid descriptor sets , 2013, Journal of Cheminformatics.