When Less Is More: Improving Classification of Protein Families with a Minimal Set of Global Features

Sequence-derived structural and physicochemical features have been used to develop models for predicting protein families. Here, we test the hypothesis that high-level functional groups of proteins may be classified by a very small set of global features directly extracted from sequence alone. To test this, we represent each protein using a small number of normalized global sequence features and classify them into functional groups, using support vector machines (SVM). Furthermore, the contribution of specific subsets of features to the classification quality is thoroughly investigated. The representation of proteins using global features provides effective information for protein family classification, with comparable results to those obtained by representation using local sequence alignment scores. Furthermore, a combination of global and local sequence features significantly improves classification performance.

[1]  G. Glusman,et al.  Position-specific codon conservation in hypervariable gene families. , 2000, Trends in genetics : TIG.

[2]  M. Levitt,et al.  A unified statistical framework for sequence comparison and structure comparison. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[3]  Robert D. Finn,et al.  The Pfam protein families database , 2004, Nucleic Acids Res..

[4]  Philip E. Bourne,et al.  Application of protein structure alignments to iterated hidden Markov model protocols for structure prediction , 2006, BMC Bioinformatics.

[5]  Burkhard Rost,et al.  TOPITS: Threading One-Dimensional Predictions Into Three-Dimensional Structures , 1995, ISMB.

[6]  E Skoufos,et al.  Conserved sequence motifs of olfactory receptor-like proteins may participate in upstream and downstream signal transduction. , 1999, Receptors & channels.

[7]  Nathan Linial,et al.  EVEREST: automatic identification and classification of protein domains in all protein sequences , 2006, BMC bioinformatics.

[8]  Jun Kawai,et al.  The Abundance of Short Proteins in the Mammalian Proteome , 2006, PLoS genetics.

[9]  D T Jones,et al.  Protein secondary structure prediction based on position-specific scoring matrices. , 1999, Journal of molecular biology.

[10]  Eytan Ruppin,et al.  Motif extraction and protein classification , 2005, 2005 IEEE Computational Systems Bioinformatics Conference (CSB'05).

[11]  Ron D. Appel,et al.  ExPASy: the proteomics server for in-depth protein knowledge and analysis , 2003, Nucleic Acids Res..

[12]  M H Saier,et al.  A family of gram-negative bacterial outer membrane factors that function in the export of proteins, carbohydrates, drugs and heavy metals from gram-negative bacteria. , 1997, FEMS microbiology letters.

[13]  Guang R. Gao,et al.  An improved hidden Markov model for transmembrane protein detection and topology prediction and its applications to complete genomes , 2005, Bioinform..

[14]  Cathy H. Wu,et al.  The Universal Protein Resource (UniProt): an expanding universe of protein information , 2005, Nucleic Acids Res..

[15]  K. Chou,et al.  Predicting protein quaternary structure by pseudo amino acid composition , 2003, Proteins.

[16]  Golan Yona,et al.  Within the twilight zone: a sensitive profile-profile comparison tool based on information theory. , 2002, Journal of molecular biology.

[17]  X. Chen,et al.  SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence , 2003, Nucleic Acids Res..

[18]  Oliver Mirus,et al.  Hiding behind Hydrophobicity , 2004, Journal of Biological Chemistry.

[19]  Kuo-Chen Chou,et al.  Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes , 2005, Bioinform..

[20]  Umar Syed,et al.  Using a mixture of probabilistic decision trees for direct prediction of protein function , 2003, RECOMB '03.

[21]  A. D. McLachlan,et al.  Profile analysis: detection of distantly related proteins. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[22]  Hanah Margalit,et al.  Glimmers in the Midnight Zone: Characterization of Aligned Identical Residues in Sequence-Dissimilar Proteins Sharing a Common Fold , 2000, ISMB.

[23]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[24]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[25]  Shmuel Pietrokovski,et al.  Increased coverage of protein families with the Blocks Database servers , 2000, Nucleic Acids Res..

[26]  S. Chakrabarti,et al.  Analysis and prediction of functionally important sites in proteins , 2007, Protein science : a publication of the Protein Society.

[27]  Emily Dimmer,et al.  The Gene Ontology Annotation (GOA) Database - An integrated resource of GO annotations to the UniProt Knowledgebase , 2003, Silico Biol..

[28]  Amos Bairoch,et al.  The PROSITE database , 2005, Nucleic Acids Res..