Simple alignment-free methods for protein classification: a case study from G-protein-coupled receptors.

Computational methods of predicting protein functions rely on detecting similarities among proteins. However, sufficient sequence information is not always available for some protein families. For example, proteins of interest may be new members of a divergent protein family. The performance of protein classification methods could vary in such challenging situations. Using the G-protein-coupled receptor superfamily as an example, we investigated the performance of several protein classifiers. Alignment-free classifiers based on support vector machines using simple amino acid compositions were effective in remote-similarity detection even from short fragmented sequences. Although it is computationally expensive, a support vector machine classifier using local pairwise alignment scores showed very good balanced performance. More commonly used profile hidden Markov models were generally highly specific and well suited to classifying well-established protein family members. It is suggested that different types of protein classifiers should be applied to gain the optimal mining power.

[1]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[2]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[3]  Alan M. Jones,et al.  Mining the Arabidopsis thaliana genome for highly-divergent seven transmembrane receptors , 2006, Genome Biology.

[4]  Peer Bork,et al.  SMART 4.0: towards genomic data integration , 2004, Nucleic Acids Res..

[5]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[6]  Judith Klein-Seetharaman,et al.  PROTEINS: Structure, Function, and Bioinformatics 58:955–970 (2005) Protein Classification Based on Text Document Classification Techniques , 2022 .

[7]  Pooja Khati,et al.  COMPARATIVE ANALYSIS OF PROTEIN CLASSIFICATION METHODS , 2004 .

[8]  Terri K. Attwood,et al.  PRINTS and its automatic supplement, prePRINTS , 2003, Nucleic Acids Res..

[9]  Roland L. Dunbrack Sequence comparison and protein structure prediction. , 2006, Current opinion in structural biology.

[10]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[11]  B. Schölkopf,et al.  Advances in kernel methods: support vector learning , 1999 .

[12]  Makiko Suwa,et al.  Automatic gene collection system for genome-scale overview of G-protein coupled receptors in eukaryotes. , 2005, Gene.

[13]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[14]  H. Hamm,et al.  A Specific Domain of Giα Required for the Transactivation of Giα by Tubulin Is Implicated in the Organization of Cellular Microtubules* , 2003, The Journal of Biological Chemistry.

[15]  Amos Bairoch,et al.  The PROSITE database , 2005, Nucleic Acids Res..

[16]  W. Pearson Searching protein sequence libraries: comparison of the sensitivity and selectivity of the Smith-Waterman and FASTA algorithms. , 1991, Genomics.

[17]  H. Schiöth,et al.  The G-protein-coupled receptors in the human genome form five main families. Phylogenetic analysis, paralogon groups, and fingerprints. , 2003, Molecular pharmacology.

[18]  Etsuko N Moriyama,et al.  Protein family classification with partial least squares. , 2007, Journal of proteome research.

[19]  Junhyong Kim,et al.  Protein family classification with discriminant function analysis , 2005 .

[20]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[21]  Huan Chen,et al.  Prediction and Classification of Human G-protein Coupled Receptors Based on Support Vector Machines , 2016, Genomics, proteomics & bioinformatics.

[22]  Takashi Nakayama,et al.  Alignment-Free Classification of G-Protein-Coupled Receptors Using Self-Organizing Maps , 2006, J. Chem. Inf. Model..

[23]  H. Goodman,et al.  A novel gene family in Arabidopsis encoding putative heptahelical transmembrane proteins homologous to human adiponectin receptors and progestin receptors. , 2005 .

[24]  Gert Vriend,et al.  GPCRDB information system for G protein-coupled receptors , 2003, Nucleic Acids Res..

[25]  Gajendra P. S. Raghava,et al.  GPCRpred: an SVM-based method for prediction of families and subfamilies of G-protein coupled receptors , 2004, Nucleic Acids Res..

[26]  Anders Krogh,et al.  Hidden Markov models for sequence analysis: extension and analysis of the basic method , 1996, Comput. Appl. Biosci..

[27]  Lukas Käll,et al.  A general model of G protein‐coupled receptor sequences and its application to detect remote homologs , 2006, Protein science : a publication of the Protein Society.

[28]  Pierre Baldi,et al.  Assessing the accuracy of prediction algorithms for classification: an overview , 2000, Bioinform..

[29]  Etsuko N. Moriyama,et al.  Identification of novel multi-transmembrane proteins from genomic databases using quasi-periodic structural properties , 2000, Bioinform..

[30]  Tim J. P. Hubbard,et al.  SCOP database in 2004: refinements integrate structure and sequence family data , 2004, Nucleic Acids Res..

[31]  Li Liao,et al.  Combining Pairwise Sequence Similarity and Support Vector Machines for Detecting Remote Protein Evolutionary and Structural Relationships , 2003, J. Comput. Biol..

[32]  J. Carlson,et al.  Candidate taste receptors in Drosophila. , 2000, Science.

[33]  Dong Xu,et al.  Computational methods for remote homolog identification. , 2005, Current protein & peptide science.

[34]  T. Lundstedt,et al.  Classification of G‐protein coupled receptors by alignment‐independent extraction of principal chemical properties of primary amino acid sequences , 2002, Protein science : a publication of the Protein Society.

[35]  William R. Taylor,et al.  The rapid generation of mutation data matrices from protein sequences , 1992, Comput. Appl. Biosci..

[36]  H. Schiöth,et al.  The Repertoire of G-Protein–Coupled Receptors in Fully Sequenced Genomes , 2005, Molecular Pharmacology.

[37]  Torbjörn Lundstedt,et al.  Multivariate analysis of G protein‐coupled receptors , 2003 .

[38]  Maria Jesus Martin,et al.  The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 , 2003, Nucleic Acids Res..

[39]  Alex Bateman,et al.  The InterPro Database, 2003 brings increased coverage and new features , 2003, Nucleic Acids Res..

[40]  David Haussler,et al.  Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology , 1996, Comput. Appl. Biosci..

[41]  Robert D. Finn,et al.  The Pfam protein families database , 2004, Nucleic Acids Res..

[42]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[43]  C. Chothia,et al.  Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure. , 2001, Journal of molecular biology.

[44]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[45]  David Haussler,et al.  Classifying G-protein coupled receptors with support vector machines , 2002, Bioinform..

[46]  Sarah M Assmann,et al.  The Arabidopsis Putative G Protein–Coupled Receptor GCR1 Interacts with the G Protein α Subunit GPA1 and Regulates Abscisic Acid Signaling , 2004, The Plant Cell Online.

[47]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[48]  Hiroaki Kitano,et al.  The PANTHER database of protein families, subfamilies, functions and pathways , 2004, Nucleic Acids Res..

[49]  David Haussler,et al.  A Discriminative Framework for Detecting Remote Protein Homologies , 2000, J. Comput. Biol..

[50]  John R. Carlson,et al.  A Novel Family of Divergent Seven-Transmembrane Proteins Candidate Odorant Receptors in Drosophila , 1999, Neuron.

[51]  R. Shoemaker,et al.  Genome exploitation : data mining the genome , 2005 .