Using genetic algorithms to select most predictive protein features

Many important characteristics of proteins such as biochemical activity and subcellular localization present a challenge to machine‐learning methods: it is often difficult to encode the appropriate input features at the residue level for the purpose of making a prediction for the entire protein. The problem is usually that the biophysics of the connection between a machine‐learning method's input (sequence feature) and its output (observed phenomenon to be predicted) remains unknown; in other words, we may only know that a certain protein is an enzyme (output) without knowing which region may contain the active site residues (input). The goal then becomes to dissect a protein into a vast set of sequence‐derived features and to correlate those features with the desired output. We introduce a framework that begins with a set of global sequence features and then vastly expands the feature space by generically encoding the coexistence of residue‐based features. It is this combination of individual features, that is the step from the fractions of serine and buried (input space 20 + 2) to the fraction of buried serine (input space 20 ☆ 2) that implicitly shifts the search space from global feature inputs to features that can capture very local evidence such as a the individual residues of a catalytic triad. The vast feature space created is explored by a genetic algorithm (GA) paired with neural networks and support vector machines. We find that the GA is critical for selecting combinations of features that are neither too general resulting in poor performance, nor too specific, leading to overtraining. The final framework manages to effectively sample a feature space that is far too large for exhaustive enumeration. We demonstrate the power of the concept by applying it to prediction of protein enzymatic activity. Proteins 2009. © 2008 Wiley‐Liss, Inc.

[1]  M. Kanehisa,et al.  Expert system for predicting protein localization sites in gram‐negative bacteria , 1991, Proteins.

[2]  A. Bairoch The ENZYME data bank. , 1993, Nucleic acids research.

[3]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[4]  D. Davies,et al.  A CORRELATION BETWEEN AMINO ACID COMPOSITION AND PROTEIN STRUCTURE. , 1964, Journal of molecular biology.

[5]  Y. Satow,et al.  Molecular recognition at the active site of subtilisin BPN': crystallographic studies using genetically engineered proteinaceous inhibitor SSI (Streptomyces subtilisin inhibitor). , 1994, Protein engineering.

[6]  Amos Bairoch,et al.  The ENZYME data bank in 1999 , 1999, Nucleic Acids Res..

[7]  B. Rost,et al.  Mimicking cellular sorting improves prediction of subcellular localization. , 2005, Journal of molecular biology.

[8]  K Schulten,et al.  VMD: visual molecular dynamics. , 1996, Journal of molecular graphics.

[9]  Marco Punta,et al.  Beyond annotation transfer by homology: novel protein-function prediction methods to assist drug discovery. , 2005, Drug discovery today.

[10]  Junwen Wang,et al.  Predictive models for protein crystallization. , 2004, Methods.

[11]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[12]  Amos Bairoch,et al.  The PROSITE database, its status in 2002 , 2002, Nucleic Acids Res..

[13]  Huan Liu,et al.  Improving backpropagation learning with feature selection , 1996, Applied Intelligence.

[14]  Burkhard Rost,et al.  Target space for structural genomics revisited , 2002, Bioinform..

[15]  B. Rost,et al.  Adaptation of protein surfaces to subcellular location. , 1998, Journal of molecular biology.

[16]  B. Rost PHD: predicting one-dimensional protein structure by profile-based neural networks. , 1996, Methods in enzymology.

[17]  B. Rost,et al.  Alignments grow, secondary structure prediction improves , 2002, Proteins.

[18]  U. Hobohm,et al.  Selection of representative protein data sets , 1992, Protein science : a publication of the Protein Society.

[19]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[20]  G. von Heijne Membrane proteins: the amino acid composition of membrane-penetrating segments. , 1981, European journal of biochemistry.

[21]  K. Nishikawa,et al.  Classification of proteins into groups based on amino acid composition and other characters. I. Angular distribution. , 1983, Journal of biochemistry.

[22]  A. Tulinsky,et al.  The refinement and the structure of the dimer of alpha-chymotrypsin at 1.67-A resolution. , 1985, The Journal of biological chemistry.

[23]  B. Rost,et al.  Automatic prediction of protein function , 2003, Cellular and Molecular Life Sciences CMLS.

[24]  Burkhard Rost,et al.  NLSdb: database of nuclear localization signals , 2003, Nucleic Acids Res..

[25]  B. Rost Twilight zone of protein sequence alignments. , 1999, Protein engineering.

[26]  C. A. Andersen,et al.  Prediction of human protein function from post-translational modifications and localization features. , 2002, Journal of molecular biology.

[27]  B. Rost,et al.  Prediction of protein secondary structure at better than 70% accuracy. , 1993, Journal of molecular biology.

[28]  Antanas Verikas,et al.  Feature selection with neural networks , 2002, Pattern Recognit. Lett..

[29]  K Nishikawa,et al.  Correlation of the amino acid composition of a protein to its structural and biological characters. , 1982, Journal of biochemistry.

[30]  E. R. Blout,et al.  THE DEPENDENCE OF THE CONFORMATIONS OF SYNTHETIC POLYPEPTIDES ON AMINO ACID COMPOSITION1,2 , 1960 .

[31]  Maria Jesus Martin,et al.  The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 , 2003, Nucleic Acids Res..

[32]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[33]  G. O. Williams,et al.  The use of d' as a “decidability” index , 1996, 1996 30th Annual International Carnahan Conference on Security Technology.

[34]  B. Rost,et al.  Improved prediction of protein secondary structure by use of sequence profiles and neural networks. , 1993, Proceedings of the National Academy of Sciences of the United States of America.

[35]  Burkhard Rost,et al.  UniqueProt: creating representative protein sequence sets , 2003, Nucleic Acids Res..

[36]  B. Rost How to Use Protein 1- D Structure Predicted by PROFphd , 2005 .

[37]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[38]  Dmitrij Frishman,et al.  Will my protein crystallize? A sequence‐based predictor , 2005, Proteins.

[39]  B. Rost Review: protein secondary structure prediction continues to rise. , 2001, Journal of structural biology.

[40]  Robert G. Beiko,et al.  GANN: Genetic algorithm neural networks for the detection of conserved combinations of features in DNA , 2005, BMC Bioinformatics.

[41]  S Brunak,et al.  Analysis and recognition of 5 ¢ UTR intron splice sites in human pre-mRNA , 2003 .

[42]  Søren Brunak,et al.  Prediction of human protein function according to Gene Ontology categories , 2003, Bioinform..

[43]  G vonHeijne,et al.  Membrane proteins: the amino acid composition of membrane-penetrating segments. , 1981, European journal of biochemistry.

[44]  B. Rost Enzyme function less conserved than anticipated. , 2002, Journal of molecular biology.

[45]  J. Whisstock,et al.  Prediction of protein function from protein sequence and structure , 2003, Quarterly Reviews of Biophysics.

[46]  J. Thompson,et al.  The CLUSTAL_X windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. , 1997, Nucleic acids research.

[47]  N. Ben-Tal,et al.  Comparison of site-specific rate-inference methods for protein sequences: empirical Bayesian methods are superior. , 2004, Molecular biology and evolution.

[48]  John Daugman,et al.  Biometric decision landscapes , 2000 .

[49]  B. Rost,et al.  Finding nuclear localization signals , 2000, EMBO reports.

[50]  K. Nishikawa,et al.  Classification of proteins into groups based on amino acid composition and other characters. II. Grouping into four types. , 1983, Journal of biochemistry.

[51]  S. Brunak,et al.  Improved prediction of signal peptides: SignalP 3.0. , 2004, Journal of molecular biology.

[52]  C. Sander,et al.  Database of homology‐derived protein structures and the structural meaning of sequence alignment , 1991, Proteins.