Predicting protective bacterial antigens using random forest classifiers

Identifying protective antigens from bacterial pathogens is important for developing vaccines. Most computational methods for predicting protein antigenicity rely on sequence similarity between a query protein sequence and at least one known antigen. Such methods limit our ability to predict novel antigens (i.e., antigens that are not homologous to any known antigen). Therefore, there is an urgent need for alignment-free computational methods for reliable prediction of protective antigens. We evaluated the discriminative power of four different amino acid composition derived feature representations using three classification methods (Logistic Regression, Support Vector Machine, and Random Forest) on a cross validation data set of 193 protective bacterial antigens and 193 non-antigenic bacterial proteins. Our results show that, with all four data representations, Random Forest classifiers consistently outperform other classifiers. We compared HRF50, one of the best performing Random Forest classifiers with VaxiJen and SignalP on independent test sets derived from the Chlamydia trachomatis and Bartonella proteomes. Our results show that our HRF50 predictor outperforms VaxiJen and is competitive with SignalP and ANTIGENpro in predicting protective antigens. We further showed that when we combine SignalP with HRF50, the resulting method, which we call BacGen, yields performance that is comparable to or better than that of ANTIGENpro in predicting antigens in bacterial sequences. We conclude that amino acid sequence composition derived features can be effectively used to design alignment-free methods for predicting protein antigenicity using Random Forest classifiers. BacGen is available as an online server at:http://ailab.cs.iastate.edu/bacgen/.

[1]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[2]  A. Haar Zur Theorie der orthogonalen Funktionensysteme , 1910 .

[3]  S. Cessie,et al.  Ridge Estimators in Logistic Regression , 1992 .

[4]  Ronald A. DeVore,et al.  Image compression through wavelet transform coding , 1992, IEEE Trans. Inf. Theory.

[5]  C.-C. Jay Kuo,et al.  Texture analysis and classification with tree-structured wavelet transform , 1993, IEEE Trans. Image Process..

[6]  Amara Lynn Graps,et al.  An introduction to wavelets , 1995 .

[7]  C. Burrus,et al.  Noise reduction using an undecimated discrete wavelet transform , 1996, IEEE Signal Processing Letters.

[8]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[9]  ScienceDirect Current opinion in microbiology , 1998 .

[10]  C. Zhang,et al.  Prediction of Membrane Protein Types Based on the Hydrophobic Index of Amino Acids , 2000, Journal of protein chemistry.

[11]  Pierre Baldi,et al.  Assessing the accuracy of prediction algorithms for classification: an overview , 2000, Bioinform..

[12]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[13]  Yuan Yan Tang,et al.  Wavelet Theory and Its Application to Pattern Recognition , 2000, Series in Machine Perception and Artificial Intelligence.

[14]  A. Krogh,et al.  Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. , 2001, Journal of molecular biology.

[15]  E. Y. Hamid,et al.  Wavelet-based data compression of power system disturbances using the minimum description length criterion , 2001 .

[16]  K. Chou Prediction of protein cellular attributes using pseudo‐amino acid composition , 2001, Proteins.

[17]  R. Rappuoli,et al.  Two years into reverse vaccinology. , 2003, Vaccine.

[18]  Pietro Liò,et al.  Wavelets in bioinformatics and computational biology: state of art and perspectives , 2003, Bioinform..

[19]  Y. Z. Chen,et al.  Protein function classification via support vector machine approach. , 2003, Mathematical biosciences.

[20]  Minoru Kanehisa,et al.  Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid pairs , 2003, Bioinform..

[21]  A. ADoefaa,et al.  ? ? ? ? f ? ? ? ? ? , 2003 .

[22]  X. Chen,et al.  SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence , 2003, Nucleic Acids Res..

[23]  Rino Rappuoli,et al.  Reverse vaccinology. , 2000, Current opinion in microbiology.

[24]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[25]  Piotr Porwik,et al.  The Haar – Wavelet Transform in Digital Image Processing : Its Status and Achievements , 2004 .

[26]  Tomaso A. Poggio,et al.  A Trainable System for Object Detection , 2000, International Journal of Computer Vision.

[27]  S. Brunak,et al.  Improved prediction of signal peptides: SignalP 3.0. , 2004, Journal of molecular biology.

[28]  Ian H. Witten,et al.  Weka-A Machine Learning Workbench for Data Mining , 2005, Data Mining and Knowledge Discovery Handbook.

[29]  Pierre Baldi,et al.  DOMpro: Protein Domain Prediction Using Profiles, Secondary Structure, Relative Solvent Accessibility, and Recursive Neural Networks , 2006, Data Mining and Knowledge Discovery.

[30]  Pierre Baldi,et al.  SCRATCH: a protein structure and structural feature prediction server , 2005, Nucleic Acids Res..

[31]  Quan Pan,et al.  Prediction of Protein Subcellular Localizations Using Moment Descriptors and Support Vector Machine , 2006, PRIB.

[32]  G. Zhong,et al.  Profiling of Human Antibody Responses to Chlamydia trachomatis Urogenital Tract Infection Using Microplates Arrayed with 156 Chlamydial Fusion Proteins , 2006, Infection and Immunity.

[33]  R. Sokal,et al.  Population structure inferred by local spatial autocorrelation: an example from an Amerindian tribal population. , 2006, American journal of physical anthropology.

[34]  Irini A. Doytchinova,et al.  BMC Bioinformatics BioMed Central Methodology article VaxiJen: a server for prediction of protective antigens, tumour , 2007 .

[35]  Christophe Garcia,et al.  WaveRead: Automatic measurement of relative gene expression levels from microarrays using wavelet analysis , 2006, J. Biomed. Informatics.

[36]  Ülo Lepik,et al.  Application of the Haar wavelet transform to solving integral and differential equations , 2007, Proceedings of the Estonian Academy of Sciences. Physics. Mathematics.

[37]  Kian-Lee Tan,et al.  Rapid retrieval of protein structures from databases. , 2007, Drug discovery today.

[38]  Vasant Honavar,et al.  Predicting Protective Linear B-Cell Epitopes Using Evolutionary Information , 2008, 2008 IEEE International Conference on Bioinformatics and Biomedicine.

[39]  P. Andersen,et al.  Antigenic profiling of a Chlamydia trachomatis gene-expression library. , 2008, The Journal of infectious diseases.

[40]  Jose A. Antonino-Daviu,et al.  A General Approach for the Transient Detection of Slip-Dependent Fault Components Based on the Discrete Wavelet Transform , 2008, IEEE Transactions on Industrial Electronics.

[41]  G. Zhong,et al.  A chlamydial type III-secreted effector protein (Tarp) is predominantly recognized by antibodies from humans infected with Chlamydia trachomatis and induces protective immunity against upper genital tract pathologies in mice. , 2009, Vaccine.

[42]  Rino Rappuoli,et al.  The use of genomics in microbial vaccine development , 2009, Drug Discovery Today.

[43]  R. Coler,et al.  Identification and characterization of novel recombinant vaccine antigens for immunization against genital Chlamydia trachomatis. , 2009, FEMS immunology and medical microbiology.

[44]  Xiao Sun,et al.  Prediction of DNA-binding residues in proteins from amino acid sequences using a random forest model with a hybrid feature , 2008, Bioinform..

[45]  Thierry Blu,et al.  Fast Haar-wavelet denoising of multidimensional fluorescence microscopy data , 2009, 2009 IEEE International Symposium on Biomedical Imaging: From Nano to Macro.

[46]  Philip L. Felgner,et al.  Identification of the Feline Humoral Immune Response to Bartonella henselae Infection by Protein Microarray , 2010, PloS one.

[47]  P. Baldi,et al.  Identification of immunodominant antigens of Chlamydia trachomatis using proteome microarrays. , 2010, Vaccine.

[48]  Alessandro Sette,et al.  Reverse vaccinology: developing vaccines in the era of genomics. , 2010, Immunity.

[49]  Matthew N Davies,et al.  Computer aided selection of candidate vaccine antigens , 2010, Immunome research.

[50]  Pierre Baldi,et al.  High-throughput prediction of protein antigenicity using protein microarray data , 2010, Bioinform..

[51]  G. Grandi,et al.  Approach to discover T- and B-cell antigens of intracellular pathogens applied to the design of Chlamydia trachomatis vaccines , 2011, Proceedings of the National Academy of Sciences.

[52]  Yongqun He,et al.  Protegen: a web-based protective antigen database and analysis system , 2010, Nucleic Acids Res..

[53]  Mohd Saberi Mohamad,et al.  Random forest for gene selection and microarray data classification , 2011, Bioinformation.

[54]  Zied Lachiri,et al.  Detecting particular features in C. elegans genomes using Synchronous Analysis based on Wavelet Transform , 2011, Int. J. Bioinform. Res. Appl..

[55]  A. Aderem,et al.  A 2020 vision for vaccines against HIV, tuberculosis and malaria , 2011, Nature.

[56]  Meng Zhao,et al.  Prediction of conformational B-cell epitopes from 3D structures by random forests with a distance-based feature , 2011, BMC Bioinformatics.

[57]  Andrew K. Chan,et al.  Fundamentals of Wavelets: Theory, Algorithms, and Applications , 2011 .

[58]  Vasant Honavar,et al.  Predicting RNA-Protein Interactions Using Only Sequence Information , 2011, BMC Bioinformatics.

[59]  Daniel Jones Reverse vaccinology on the cusp , 2012, Nature Reviews Drug Discovery.