A similarity distance of diversity measure for discriminating mesophilic and thermophilic proteins

The successful prediction of thermophilic proteins is useful for designing stable enzymes that are functional at high temperature. We have used the increment of diversity (ID), a novel amino acid composition-based similarity distance, in a 2-class K-nearest neighbor classifier to classify thermophilic and mesophilic proteins. And the KNN-ID classifier was successfully developed to predict the thermophilic proteins. Instead of extracting features from protein sequences as done previously, our approach was based on a diversity measure of symbol sequences. The similarity distance between each pair of protein sequences was first calculated to quantitatively measure the similarity level of one given sequence and the other. The query protein is then determined using the K-nearest neighbor algorithm. Comparisons with multiple recently published methods showed that the KNN-ID proposed in this study outperforms the other methods. The improved predictive performance indicated it is a simple and effective classifier for discriminating thermophilic and mesophilic proteins. At last, the influence of protein length and protein identity on prediction accuracy was discussed further. The prediction model and dataset used in this article can be freely downloaded from http://wlxy.imu.edu.cn/college/biostation/fuwu/KNN-ID/index.htm.

[1]  R. Laxton The measure of diversity. , 1978, Journal of theoretical biology.

[2]  Ming-Tat Ko,et al.  Amino acid coupling patterns in thermophilic proteins , 2005, Proteins.

[3]  Baishan Fang,et al.  LogitBoost classifier for discriminating thermophilic and mesophilic proteins. , 2007, Journal of biotechnology.

[4]  A. Karshikoff,et al.  Proteins from thermophilic and mesophilic organisms essentially do not differ in packing. , 1998, Protein engineering.

[5]  M. Gromiha,et al.  Important amino acid properties for enhanced thermostability from mesophilic to thermophilic proteins. , 1999, Biophysical chemistry.

[6]  X.-X. Zhou,et al.  Differences in amino acids composition and coupling patterns between mesophilic and thermophilic proteins , 2007, Amino Acids.

[7]  Karen M Polizzi,et al.  High-throughput screening for enhanced protein stability. , 2006, Current opinion in biotechnology.

[8]  Kuo-Chen Chou,et al.  Using optimized evidence-theoretic K-nearest neighbor classifier and pseudo-amino acid composition to predict membrane protein types. , 2005, Biochemical and biophysical research communications.

[9]  Qian-zhong Li,et al.  Using reduced amino acid composition to predict defensin family and subfamily: Integrating similarity measure and structural alphabet , 2009, Peptides.

[10]  K Watanabe,et al.  Archaeal adaptation to higher temperatures revealed by genomic sequence of Thermoplasma volcanium. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[11]  K. Chou,et al.  EzyPred: a top-down approach for predicting enzyme functional classes and subclasses. , 2007, Biochemical and biophysical research communications.

[12]  Jean-Michel Claverie,et al.  Genomic Correlates of Hyperthermostability, an Update* , 2003, The Journal of Biological Chemistry.

[13]  Kuo-Chen Chou,et al.  Fuzzy KNN for predicting membrane protein types from pseudo-amino acid composition. , 2006, Journal of theoretical biology.

[14]  Igor N. Berezovsky,et al.  Protein and DNA Sequence Determinants of Thermophilic Adaptation , 2006, PLoS Comput. Biol..

[15]  Wei Chen,et al.  Prediction of thermophilic proteins using feature selection technique. , 2011, Journal of microbiological methods.

[16]  Kuo-Chen Chou,et al.  Ensemble classifier for protein fold pattern recognition , 2006, Bioinform..

[17]  A. Szilágyi,et al.  Structural differences between mesophilic, moderately thermophilic and extremely thermophilic protein subunits: results of a comprehensive survey. , 2000, Structure.

[18]  R. Huber,et al.  Towards the ecology of hyperthermophiles: biotopes, new isolation strategies and novel metabolic properties. , 2000, FEMS microbiology reviews.

[19]  M Michael Gromiha,et al.  Discrimination of mesophilic and thermophilic proteins using machine learning algorithms , 2007, Proteins.

[20]  K. Chou,et al.  Euk-mPLoc: a fusion classifier for large-scale eukaryotic protein subcellular location prediction by incorporating multiple sites. , 2007, Journal of proteome research.

[21]  Baishan Fang,et al.  Discrimination of thermophilic and mesophilic proteins via pattern recognition methods , 2006 .

[22]  G. Olsen,et al.  Thermal adaptation analyzed by comparison of protein sequences from mesophilic and extremely thermophilic Methanococcus species. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[23]  M Michael Gromiha,et al.  Motifs in outer membrane protein sequences: applications for discrimination. , 2005, Biophysical chemistry.

[24]  Qianzhong Li,et al.  Using pseudo amino acid composition to predict protein structural class: Approached by incorporating 400 dipeptide components , 2007, J. Comput. Chem..

[25]  C. Cambillau,et al.  Structural and Genomic Correlates of Hyperthermostability* , 2000, The Journal of Biological Chemistry.

[26]  A. Elcock The stability of salt bridges at high temperatures: implications for hyperthermophilic proteins. , 1998, Journal of molecular biology.

[27]  Kuo-Chen Chou,et al.  Using supervised fuzzy clustering to predict protein structural classes. , 2005, Biochemical and biophysical research communications.

[28]  M. Bonato,et al.  Preferred amino acids and thermostability. , 2003, Genetics and molecular research : GMR.

[29]  M. Gerstein,et al.  The stability of thermophilic proteins: a study based on comprehensive genome comparison , 2000, Functional & Integrative Genomics.

[30]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[31]  Qian-zhong Li,et al.  Using K-minimum increment of diversity to predict secretory proteins of malaria parasite based on groupings of amino acids , 2010, Amino Acids.

[32]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[33]  Piero Fariselli,et al.  Predicting protein thermostability changes from sequence upon multiple mutations , 2008, ISMB.

[34]  T M Handel,et al.  Review: protein design--where we were, where we are, where we're going. , 2001, Journal of structural biology.

[35]  Q. Z. Li,et al.  The prediction of the structural class of protein: application of the measure of diversity. , 2001, Journal of theoretical biology.

[36]  Kuo-Chen Chou,et al.  Predicting eukaryotic protein subcellular location by fusing optimized evidence-theoretic K-Nearest Neighbor classifiers. , 2006, Journal of proteome research.

[37]  Kuo-Chen Chou,et al.  Prediction protein structural classes with pseudo-amino acid composition: approximate entropy and hydrophobicity pattern. , 2008, Journal of theoretical biology.

[38]  W. F. Li,et al.  Structural features of thermozymes. , 2005, Biotechnology advances.

[39]  Baishan Fang,et al.  Application of amino acid distribution along the sequence for discriminating mesophilic and thermophilic proteins , 2006 .

[40]  C. Vieille,et al.  Bivalent cations and amino-acid composition contribute to the thermostability of Bacillus licheniformis xylose isomerase. , 2001, European journal of biochemistry.

[41]  Jingbo Xia,et al.  Prediction of thermophilic protein with pseudo amino Acid composition: an approach from combined feature selection and reduction. , 2011, Protein and peptide letters.

[42]  K. Chou,et al.  Recent progress in protein subcellular location prediction. , 2007, Analytical biochemistry.

[43]  F.-M. Li,et al.  Using pseudo amino acid composition to predict protein subnuclear location with improved hybrid approach , 2007, Amino Acids.

[44]  Jianwen Fang,et al.  Distance-dependent statistical potentials for discriminating thermophilic and mesophilic proteins. , 2010, Biochemical and biophysical research communications.

[45]  C. Vieille,et al.  Hyperthermophilic Enzymes: Sources, Uses, and Molecular Mechanisms for Thermostability , 2001, Microbiology and Molecular Biology Reviews.

[46]  K. Chou,et al.  Gpos-PLoc: an ensemble classifier for predicting subcellular localization of Gram-positive bacterial proteins. , 2007, Protein engineering, design & selection : PEDS.

[47]  M. Perutz,et al.  Stereochemical basis of heat stability in bacterial ferredoxins and in haemoglobin A2 , 1975, Nature.

[48]  M. Sadeghi,et al.  Effective factors in thermostability of thermophilic proteins. , 2006, Biophysical chemistry.

[49]  R. Nussinov,et al.  Factors enhancing protein thermostability. , 2000, Protein engineering.

[50]  D Eisenberg,et al.  Transproteomic evidence of a loop-deletion mechanism for enhancing protein thermostability. , 1999, Journal of molecular biology.

[51]  K. Nishikawa,et al.  Protein surface amino acid compositions distinctively differ between thermophilic and mesophilic bacteria. , 2001, Journal of molecular biology.

[52]  Kuo-Chen Chou,et al.  Large‐scale plant protein subcellular location prediction , 2007, Journal of cellular biochemistry.

[53]  Songyot Nakariyakul,et al.  Detecting thermophilic proteins through selecting amino acid and dipeptide composition features , 2011, Amino Acids.

[54]  K. Chou A novel approach to predicting protein structural classes in a (20–1)‐D amino acid composition space , 1995, Proteins.

[55]  R. Radi,et al.  Protein tyrosine nitration in hydrophilic and hydrophobic environments , 2006, Amino Acids.

[56]  Qian-zhong Li,et al.  Identification of TATA and TATA-less promoters in plant genomes by integrating diversity measure, GC-Skew and DNA geometric flexibility. , 2011, Genomics.

[57]  R. Levy,et al.  Simplified amino acid alphabets for protein fold recognition and implications for folding. , 2000, Protein engineering.

[58]  A. Hermetter,et al.  Activity-based proteomics: enzymatic activity profiling in complex proteomes , 2006, Amino Acids.