Multi-RELIEF: a method to recognize specificity determining residues from multiple sequence alignments using a Machine-Learning approach for feature weighting

MOTIVATION Identification of residues that account for protein function specificity is crucial, not only for understanding the nature of functional specificity, but also for protein engineering experiments aimed at switching the specificity of an enzyme, regulator or transporter. Available algorithms generally use multiple sequence alignments to identify residue positions conserved within subfamilies but divergent in between. However, many biological examples show a much subtler picture than simple intra-group conservation versus inter-group divergence. RESULTS We present multi-RELIEF, a novel approach for identifying specificity residues that is based on RELIEF, a state-of-the-art Machine-Learning technique for feature weighting. It estimates the expected 'local' functional specificity of residues from an alignment divided in multiple classes. Optionally, 3D structure information is exploited by increasing the weight of residues that have high-weight neighbors. Using ROC curves over a large body of experimental reference data, we show that (a) multi-RELIEF identifies specificity residues for the seven test sets used, (b) incorporating structural information improves prediction for specificity of interaction with small molecules and (c) comparison of multi-RELIEF with four other state-of-the-art algorithms indicates its robustness and best overall performance. AVAILABILITY A web-server implementation of multi-RELIEF is available at www.ibi.vu.nl/programs/multirelief. Matlab source code of the algorithm and data sets are available on request for academic use.

[1]  Jeffrey Miller,et al.  Genetic Studies of Lac Repressor: 4000 Single Amino Acid Substitutions and Analysis of the Resulting Phenotypes on the Basis of the Protein Structure , 1996, German Conference on Bioinformatics.

[2]  Ulrik Gether,et al.  Structural basis for activation of G-protein-coupled receptors. , 2002, Pharmacology & toxicology.

[3]  Charlotte Harrison,et al.  Seven-transmembrane receptors: One way only , 2009, Nature Reviews Drug Discovery.

[4]  Jaap Heringa,et al.  Sequence harmony: detecting functional specificity from alignments , 2007, Nucleic Acids Res..

[5]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[6]  R. Derynck,et al.  SPECIFICITY AND VERSATILITY IN TGF-β SIGNALING THROUGH SMADS , 2005 .

[7]  Rafael Zardoya,et al.  A Phylogenetic Framework for the Aquaporin Family in Eukaryotes , 2001, Journal of Molecular Evolution.

[8]  Jaime Prilusky,et al.  Automated analysis of interatomic contacts in proteins , 1999, Bioinform..

[9]  J H Miller,et al.  Genetic studies of the lac repressor. I. Correlation of mutational sites with specific amino acid residues: construction of a colinear gene-protein map. , 1977, Journal of molecular biology.

[10]  Igor Kononenko,et al.  Estimating Attributes: Analysis and Extensions of RELIEF , 1994, ECML.

[11]  B. Erman,et al.  Information‐theoretical entropy as a measure of sequence variability , 1991, Proteins.

[12]  J. Heringa,et al.  Sequence comparison by sequence harmony identifies subtype-specific functional sites , 2006, Nucleic acids research.

[13]  J. Massagué,et al.  Smad transcription factors. , 2005, Genes & development.

[14]  Peter J Bickel,et al.  Finding important sites in protein sequences , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[15]  R. Russell,et al.  Analysis and prediction of functional sub-types from protein sequence alignments. , 2000, Journal of molecular biology.

[16]  R. Derynck,et al.  Specificity and versatility in tgf-beta signaling through Smads. , 2005, Annual review of cell and developmental biology.

[17]  Jian Li,et al.  Iterative RELIEF for feature weighting , 2006, ICML.

[18]  Elena Marchiori,et al.  A Feature Selection Algorithm for Detecting Subtype Specific Functional Sites from Protein Sequences for Smad Receptor Binding , 2006, 2006 5th International Conference on Machine Learning and Applications (ICMLA'06).

[19]  J A Swets,et al.  Measuring the accuracy of diagnostic systems. , 1988, Science.

[20]  C. Der,et al.  The Ras branch of small GTPases: Ras family members don't fall far from the tree. , 2000, Current opinion in cell biology.

[21]  Xun Gu,et al.  A simple statistical method for estimating type-II (cluster-specific) functional divergence of protein sequences. , 2006, Molecular biology and evolution.

[22]  Mikhail S. Gelfand,et al.  SDPpred: a tool for prediction of amino acid residues that determine differences in functional specificity of homologous proteins , 2004, Nucleic Acids Res..

[23]  J. Whisstock,et al.  Prediction of protein function from protein sequence and structure , 2003, Quarterly Reviews of Biophysics.

[24]  Kai Ye,et al.  A two‐entropies analysis to identify functional positions in the transmembrane region of class A G protein‐coupled receptors , 2006, Proteins.

[25]  Marko Robnik-Sikonja,et al.  Theoretical and Empirical Analysis of ReliefF and RReliefF , 2003, Machine Learning.

[26]  L. Mirny,et al.  Using orthologous and paralogous proteins to identify specificity determining residues. , 2002, Genome biology.

[27]  A. Valencia,et al.  Automatic methods for predicting functionally important residues. , 2003, Journal of molecular biology.

[28]  D. Fu,et al.  Structure of a glycerol-conducting channel and the basis for its selectivity. , 2000, Science.

[29]  Sanghyuk Lee,et al.  ECgene: an alternative splicing database update , 2006, Nucleic Acids Res..

[30]  O. Lichtarge,et al.  A family of evolution-entropy hybrid methods for ranking protein residues by importance. , 2004, Journal of molecular biology.

[31]  D. Eisenberg,et al.  Three-dimensional cluster analysis identifies interfaces and functional residue clusters in proteins. , 2001, Journal of molecular biology.

[32]  Ron Kohavi,et al.  Guest Editors' Introduction: On Applied Research in Machine Learning , 1998, Machine Learning.

[33]  Gert Vriend,et al.  GPCRDB information system for G protein-coupled receptors , 2003, Nucleic Acids Res..

[34]  Alfonso Valencia,et al.  TreeDet: a web server to explore sequence space , 2006, Nucleic Acids Res..