A Support Vector Machine based method to distinguish proteobacterial proteins from eukaryotic plant proteins

BackgroundMembers of the phylum Proteobacteria are most prominent among bacteria causing plant diseases that result in a diminution of the quantity and quality of food produced by agriculture. To ameliorate these losses, there is a need to identify infections in early stages. Recent developments in next generation nucleic acid sequencing and mass spectrometry open the door to screening plants by the sequences of their macromolecules. Such an approach requires the ability to recognize the organismal origin of unknown DNA or peptide fragments. There are many ways to approach this problem but none have emerged as the best protocol. Here we attempt a systematic way to determine organismal origins of peptides by using a machine learning algorithm. The algorithm that we implement is a Support Vector Machine (SVM).ResultThe amino acid compositions of proteobacterial proteins were found to be different from those of plant proteins. We developed an SVM model based on amino acid and dipeptide compositions to distinguish between a proteobacterial protein and a plant protein. The amino acid composition (AAC) based SVM model had an accuracy of 92.44% with 0.85 Matthews correlation coefficient (MCC) while the dipeptide composition (DC) based SVM model had a maximum accuracy of 94.67% and 0.89 MCC. We also developed SVM models based on a hybrid approach (AAC and DC), which gave a maximum accuracy 94.86% and a 0.90 MCC. The models were tested on unseen or untrained datasets to assess their validity.ConclusionThe results indicate that the SVM based on the AAC and DC hybrid approach can be used to distinguish proteobacterial from plant protein sequences.

[1]  Eugene W. Myers,et al.  Basic local alignment search tool. Journal of Molecular Biology , 1990 .

[2]  H. Benali,et al.  Support vector machine-based classification of Alzheimer’s disease from whole-brain anatomical MRI , 2009, Neuroradiology.

[3]  R Wiebringhaus,et al.  [ROC analysis of image quality in digital luminescence radiography in comparison with current film-screen systems in mammography]. , 1995, Aktuelle Radiologie.

[4]  D. Prvulovic,et al.  Using Support Vector Machines with Multiple Indices of Diffusion for Automated Classification of Mild Cognitive Impairment , 2012, PloS one.

[5]  T. Lilburn,et al.  A Novel Lineage of Proteobacteria Involved in Formation of Marine Fe-Oxidizing Microbial Mat Communities , 2007, PloS one.

[6]  Pierre Baldi,et al.  Assessing the accuracy of prediction algorithms for classification: an overview , 2000, Bioinform..

[7]  Roberto Basili,et al.  Learning to Classify Text Using Support Vector Machines: Methods, Theory, and Algorithms by Thorsten Joachims , 2003, Comput. Linguistics.

[8]  G. Raghava,et al.  Prediction of mitochondrial proteins of malaria parasite using split amino acid composition and PSSM profile , 2010, Amino Acids.

[9]  C D Creelman,et al.  ROC curves for discrimination of linear extent. , 1968, Journal of experimental psychology.

[10]  C. Caranta Recent Advances in Plant Virology , 2011 .

[11]  Nello Cristianini,et al.  Support vector machine classification and validation of cancer tissue samples using microarray expression data , 2000, Bioinform..

[12]  Stephen T. C. Wong,et al.  Support vector machine multiparametric MRI identification of pseudoprogression from tumor recurrence in patients with resected glioblastoma , 2011, Journal of magnetic resonance imaging : JMRI.

[13]  Gajendra PS Raghava,et al.  Identification of conformational B-cell Epitopes in an antigen from its primary sequence , 2010, Immunome research.

[14]  Jean-Philippe Vert,et al.  Support Vector Machine Prediction of Signal Peptide Cleavage Site Using a New Class of Kernels for Strings , 2001, Pacific Symposium on Biocomputing.

[15]  P. Dharmasaroja,et al.  Prediction of intracerebral hemorrhage following thrombolytic therapy for acute ischemic stroke using multiple artificial neural networks , 2012, Neurological research.

[16]  G M Raab,et al.  When are summary ROC curves appropriate for diagnostic meta‐analyses? , 2009, Statistics in medicine.

[17]  ROC parameters in item and context recognition. , 2007, Psicothema.

[18]  Daurès Jp Use of ROC curves in medical imaging , 1991 .

[19]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[20]  B. Sobral,et al.  Plant Pathogen Forensics: Capabilities, Needs, and Recommendations , 2006, Microbiology and Molecular Biology Reviews.

[21]  Ibrahim A Naguib,et al.  Support vector regression and artificial neural network models for stability indicating analysis of mebeverine hydrochloride and sulpiride mixtures in pharmaceutical preparation: a comparative study. , 2012, Spectrochimica acta. Part A, Molecular and biomolecular spectroscopy.

[22]  J. Liehn,et al.  ROC analysis in radioimmunoassay: An application to the interpretation of thyroglobulin measurement in the follow-up of thyroid carcinoma , 2004, European Journal of Nuclear Medicine.

[23]  Xin He,et al.  ROC, LROC, FROC, AFROC: an alphabet soup. , 2009, Journal of the American College of Radiology : JACR.

[24]  J. Daurès [Use of ROC curves in medical imaging]. , 1991, Journal de radiologie.

[25]  R. Elston,et al.  Bagging Optimal ROC Curve Method for Predictive Genetic Tests, with an Application for Rheumatoid Arthritis , 2010, Journal of biopharmaceutical statistics.

[26]  M. W. Green,et al.  2. Handbook of the Logistic Distribution , 1991 .

[27]  P. R. Scott,et al.  Plant disease: a threat to global food security. , 2005, Annual review of phytopathology.

[28]  Min Wei,et al.  Spectrochimica Acta Part A : Molecular and Biomolecular Spectroscopy , 2013 .

[29]  Gajendra P. S. Raghava,et al.  Identification of Proteins Secreted by Malaria Parasite into Erythrocyte using SVM and PSSM profiles , 2008, BMC Bioinformatics.

[30]  Zhongwei Jiang,et al.  Cardiac sound murmurs classification with autoregressive spectral analysis and multi-support vector machine technique , 2010, Comput. Biol. Medicine.

[31]  J. Piette,et al.  Mycophenolic acid area under the curve correlates with disease activity in lupus patients treated with mycophenolate mofetil. , 2010, Arthritis and rheumatism.

[32]  Gajendra P.S. Raghava,et al.  RSLpred: an integrative system for predicting subcellular localization of rice proteins combining compositional and evolutionary information , 2009, Proteomics.