Application of Intelligent Techniques for Classification of Bacteria Using Protein Sequence-Derived Features

Standard molecular experimental methodologies and mathematical procedures often fail to answer many phylogeny and classification related issues. Modern artificial intelligent-based techniques, such as radial basis function, genetic algorithm, artificial neural network, and support vector machines are of ample potential in this regard. Reliance on a large number of essential parameters will aid in enhanced robustness, reliability, and better accuracy as opposed to single molecular parameter. This study was conducted with dataset of computed protein physicochemical properties belonging to 20 different bacterial genera. A total of 57 sequential and structural parameters derived from protein sequences were considered for the initial classification. Feature selection based techniques were employed to find out the most important features influencing the dataset. Various amino acids, hydrophobicity, relative sulfur percentage, and codon number were selected as important parameters during the study. Comparative analyses were performed applying RapidMiner data mining platform. Support vector machine proved to be the best method with maximum accuracy of more than 91 %.

[1]  Amit Kumar Banerjee,et al.  TOWARDS CLASSIFYING ORGANISMS BASED ON THEIR PROTEIN PHYSICOCHEMICAL PROPERTIES USING COMPARATIVE INTELLIGENT TECHNIQUES , 2011, Appl. Artif. Intell..

[2]  Ponnuthurai N. Suganthan,et al.  A machine learning approach for the identification of odorant binding proteins from sequence-derived properties , 2007, BMC Bioinformatics.

[3]  M. Surette,et al.  Dimerization Is Required for the Activity of the Protein Histidine Kinase CheA That Mediates Signal Transduction in Bacterial Chemotaxis (*) , 1996, The Journal of Biological Chemistry.

[4]  Nello Cristianini,et al.  An introduction to Support Vector Machines , 2000 .

[5]  Shibu Yooseph,et al.  Gene identification and protein classification in microbial metagenomic sequence data via incremental clustering , 2007, BMC Bioinformatics.

[6]  Antifungal Properties and Target Evaluation of Three Putative Bacterial Histidine Kinase Inhibitors , 1999, Antimicrobial Agents and Chemotherapy.

[7]  John M. Walker,et al.  The Proteomics Protocols Handbook , 2005, Humana Press.

[8]  Neelima Arora,et al.  Application of Kohonen maps for solving the classification puzzle in AGC kinase protein sequences , 2009, Interdisciplinary Sciences: Computational Life Sciences.

[9]  Adam P. Arkin,et al.  The Evolution of Two-Component Systems in Bacteria Reveals Different Strategies for Niche Adaptation , 2006, PLoS Comput. Biol..

[10]  Loris Nanni,et al.  Machine learning multi-classifiers for peptide classification , 2009, Neural Computing and Applications.

[11]  Kim D Janda,et al.  Histidine kinases as targets for new antimicrobial agents. , 2002, Bioorganic & medicinal chemistry.

[12]  Mark R. Segal,et al.  Biological sequence classification utilizing positive and unlabeled data , 2008, Bioinform..

[13]  Panos M. Pardalos,et al.  Decision rules for efficient classification of biological data , 2009, Optim. Lett..

[14]  C. Hoogland,et al.  In The Proteomics Protocols Handbook , 2005 .

[15]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[16]  M J Sternberg,et al.  Machine learning approach for the prediction of protein secondary structure. , 1990, Journal of molecular biology.

[17]  N. Wingreen,et al.  Probing bacterial transmembrane histidine kinase receptor–ligand interactions with natural and synthetic molecules , 2010, Proceedings of the National Academy of Sciences.

[18]  Neelima Arora,et al.  Exploring the Interplay of Sequence and Structural Features in Determining the Flexibility of AGC Kinase Protein Family : A Bioinformatics Approach , 2008 .

[19]  R D Appel,et al.  Protein identification and analysis tools in the ExPASy server. , 1999, Methods in molecular biology.

[20]  George R. Thoma,et al.  Annotation and retrieval of clinically relevant images , 2009, Int. J. Medical Informatics.

[21]  Mohsen Beheshti,et al.  Diabetes Data Analysis and Prediction Model Discovery Using RapidMiner , 2008, 2008 Second International Conference on Future Generation Communication and Networking.

[22]  H. Godfray Challenges for taxonomy , 2002, Nature.

[23]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[24]  Werner Dubitzky,et al.  Fundamentals of Data Mining in Genomics and Proteomics , 2009 .

[25]  C. Ames,et al.  Estimating the post-mortem interval (I): The use of genetic markers to aid in identification of Dipteran species and subpopulations , 2006 .

[26]  Zhiqiang Qin,et al.  Structure-based discovery of inhibitors of the YycG histidine kinase: new chemical leads to combat Staphylococcus epidermidis infections. , 2006, BMC microbiology.

[27]  Ling Zhang,et al.  An Integrated Machine Learning System to Computationally Screen Protein Databases for Protein Binding Peptide Ligands*S , 2006, Molecular & Cellular Proteomics.

[28]  Neelima Arora,et al.  An In Silico Approach to Cluster CAM Kinase Protein Sequences , 2009 .

[29]  Feng-Sheng Wang,et al.  Hybrid differential evolution with multiplier updating method for nonlinear constrained optimization problems , 2002, Proceedings of the 2002 Congress on Evolutionary Computation. CEC'02 (Cat. No.02TH8600).

[30]  Nimrod D. Rubinstein,et al.  A machine-learning approach for predicting B-cell epitopes. , 2009, Molecular immunology.

[31]  D. Kim,et al.  Genomic analysis of the histidine kinase family in bacteria and archaea. , 2001, Microbiology.