Bacterial Immunogenicity Prediction by Machine Learning Methods

The identification of protective immunogens is the most important and vigorous initial step in the long-lasting and expensive process of vaccine design and development. Machine learning (ML) methods are very effective in data mining and in the analysis of big data such as microbial proteomes. They are able to significantly reduce the experimental work for discovering novel vaccine candidates. Here, we applied six supervised ML methods (partial least squares-based discriminant analysis, k nearest neighbor (kNN), random forest (RF), support vector machine (SVM), random subspace method (RSM), and extreme gradient boosting) on a set of 317 known bacterial immunogens and 317 bacterial non-immunogens and derived models for immunogenicity prediction. The models were validated by internal cross-validation in 10 groups from the training set and by the external test set. All of them showed good predictive ability, but the xgboost model displays the most prominent ability to identify immunogens by recognizing 84% of the known immunogens in the test set. The combined RSM-kNN model was the best in the recognition of non-immunogens, identifying 92% of them in the test set. The three best performing ML models (xgboost, RSM-kNN, and RF) were implemented in the new version of the server VaxiJen, and the prediction of bacterial immunogens is now based on majority voting.

[1]  Jamil Ahmad,et al.  VacSol: a high throughput in silico pipeline to predict potential therapeutic targets in prokaryotic pathogens using subtractive reverse vaccinology , 2017, BMC Bioinformatics.

[2]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[3]  Yongqun He,et al.  Vaxign: The First Web-Based Vaccine Design Program for Reverse Vaccinology and Applications for Vaccine Development , 2010, Journal of biomedicine & biotechnology.

[4]  Mathura S Venkatarajan,et al.  New quantitative descriptors of amino acids based on multidimensional scaling of a large number of physical–chemical properties , 2001 .

[5]  D. Asir Antony Gnana Singh,et al.  Literature Review on Feature Selection Methods for High-Dimensional Data , 2016 .

[6]  Irini A. Doytchinova,et al.  BMC Bioinformatics BioMed Central Methodology article VaxiJen: a server for prediction of protective antigens, tumour , 2007 .

[7]  Tin Kam Ho,et al.  The Random Subspace Method for Constructing Decision Forests , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[8]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[9]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[10]  The UniProt Consortium,et al.  UniProt: a worldwide hub of protein knowledge , 2018, Nucleic Acids Res..

[11]  Irini A. Doytchinova,et al.  Journal of Proteomics & Bioinformatics , 2022 .

[12]  Irini Doytchinova,et al.  VaxiJen Dataset of Bacterial Immunogens: An Update. , 2019, Current computer-aided drug design.

[13]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[14]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[15]  Francesco Filippini,et al.  NERVE: New Enhanced Reverse Vaccinology Environment , 2006, BMC biotechnology.

[16]  Donald A. Adjeroh,et al.  Random KNN , 2014, 2014 IEEE International Conference on Data Mining Workshop.

[17]  Jiye Liang,et al.  An efficient instance selection algorithm for k nearest neighbor regression , 2017, Neurocomputing.

[18]  R. Boggia,et al.  Genetic algorithms as a strategy for feature selection , 1992 .

[19]  S. Wold,et al.  DNA and peptide sequences and chemical processes multivariately modelled by principal component analysis and partial least-squares projections to latent structures , 1993 .

[20]  Rino Rappuoli,et al.  Comparison of Open-Source Reverse Vaccinology Programs for Bacterial Vaccine Antigen Discovery , 2019, Front. Immunol..

[21]  Paul J. Kennedy,et al.  Vacceed: a high-throughput in silico vaccine candidate discovery pipeline for eukaryotic pathogens based on reverse vaccinology , 2014, Bioinform..

[22]  Ankit Gupta,et al.  Jenner-predict server: prediction of protein vaccine candidates (PVCs) in bacteria based on host-pathogen interactions , 2013, BMC Bioinformatics.

[23]  Mahesan Niranjan,et al.  Enhancing the Biological Relevance of Machine Learning Classifiers for Reverse Vaccinology , 2017, International journal of molecular sciences.

[24]  S. Wold,et al.  Peptide quantitative structure-activity relationships, a multivariate approach. , 1987, Journal of medicinal chemistry.

[25]  Fabio Bagnoli,et al.  Protectome Analysis: A New Selective Bioinformatics Tool for Bacterial Vaccine Candidate Discovery , 2014, Molecular & Cellular Proteomics.

[26]  Leonard Moise,et al.  iVAX: An integrated toolkit for the selection and optimization of antigens and the design of epitope-driven vaccines , 2015, Human vaccines & immunotherapeutics.

[27]  Evan Bolton,et al.  Database resources of the National Center for Biotechnology Information , 2017, Nucleic Acids Res..

[28]  Faramarz Valafar,et al.  Improving reverse vaccinology with a machine learning approach. , 2011, Vaccine.