Automatic Discovery Using Genetic Programming of an Unknown-Sized Detector of Protein Motifs Containing Repeatedly-Used Subexpressions

Automated methods of machine learning may be useful in discovering biologically meaningful patterns that are hidden in the rapidly growing databases of genomic and protein sequences. However, almost all existing methods of automated discovery require that the user specify, in advance, the size and shape of the pattern that is to be discovered. Moreover, existing methods do not have a workable analog of the idea of a reusable subroutine to exploit the recurring subpatterns of a problem environment. Genetic programming can evolve complicated problem-solving expressions of unspecified size and shape. When automatically defined functions are added to genetic programming, genetic programming becomes capable of efficiently capturing and exploiting recurring sub-patterns. This paper describes how genetic programming with automatically defined functions successfully evolved motifs for detecting the D-E-A-D box family of proteins and for detecting the manganese superoxide dismutase family. Both motifs were evolved without prespecifying their length. Both evolved motifs employed automatically defined functions to capture the repeated use of common subexpressions. When tested against the SWISS-PROT database of proteins, the two genetically evolved consensus motifs detect the two families either as well, or slightly better than, the comparable humanwritten motifs found in the PROSITE database.

[1]  C. Anfinsen Principles that govern the folding of protein chains. , 1973, Science.

[2]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[3]  B. Bainbridge,et al.  Genetics , 1981, Experientia.

[4]  G J Williams,et al.  The Protein Data Bank: a computer-based archival file for macromolecular structures. , 1978, Archives of biochemistry and biophysics.

[5]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[6]  R. Doolittle,et al.  A simple method for displaying the hydropathic character of a protein. , 1982, Journal of molecular biology.

[7]  G Rotilio,et al.  Aspects of the structure, function, and applications of superoxide dismutase. , 1987, CRC critical reviews in biochemistry.

[8]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[9]  T. Hodgman,et al.  A new superfamily of replicative proteins , 1988, Nature.

[10]  P. Slonimski,et al.  Birth of the D-E-A-D box , 1989, Nature.

[11]  John R. Koza,et al.  Hierarchical Genetic Algorithms Operating on Populations of Computer Programs , 1989, IJCAI.

[12]  A. Christensen,et al.  A novel RNA helicase gene tightly linked to the Triplo-lethal locus of Drosophila. , 1990, Nucleic acids research.

[13]  G A Petsko,et al.  The structure of iron superoxide dismutase from Pseudomonas ovalis complexed with the inhibitor azide. , 1990, Protein engineering.

[14]  J. Abelson,et al.  Identification of five putative yeast RNA helicase genes. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[15]  W. Stallings,et al.  Manganese superoxide dismutase from Thermus thermophilus. A structural model refined at 1.8 A resolution. , 1991, Journal of molecular biology.

[16]  John R. Koza,et al.  Genetic Programming: The Movie , 1992 .

[17]  D C Richardson,et al.  The kinemage: A tool for scientific communication , 1992, Protein science : a publication of the Protein Society.

[18]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[19]  John R. Koza,et al.  Genetic programming - on the programming of computers by means of natural selection , 1993, Complex adaptive systems.

[20]  B. Rost,et al.  Prediction of protein secondary structure at better than 70% accuracy. , 1993, Journal of molecular biology.

[21]  Simon Handley,et al.  Automated Learning of a Detector for alpha-Helices in Protein Sequences via Genetic Programming , 1993, ICGA.

[22]  Una-May O'Reilly,et al.  Genetic Programming II: Automatic Discovery of Reusable Programs. , 1994, Artificial Life.

[23]  John R. Koza,et al.  Evolution of a Computer Program for Classifying Protein Segments as Transmembrane Domains Using Genetic Programming , 1994, ISMB.

[24]  A. Bairoch,et al.  The SWISS-PROT protein sequence data bank: current status. , 1994, Nucleic acids research.

[25]  John R. Koza,et al.  Genetic programming II (videotape): the next generation , 1994 .

[26]  A. Bairoch,et al.  PROSITE: recent developments. , 1994, Nucleic acids research.

[27]  Simon Handley The Prediction of the Degree of Exposure to Solvent of Amino Acid Residues Via Genetic Programming , 1994, ISMB.

[28]  J. K. Kinnear,et al.  Advances in Genetic Programming , 1994 .

[29]  S. Handley Predicting whether or not a nucleic acid sequence is an E. coli promoter region using genetic programming , 1995, Proceedings First International Symposium on Intelligence in Neural and Biological Systems. INBS'95.

[30]  Simon Handley Classifying Nucleic Acid Sub-Sequences as Introns or Exons Using Genetic Programming , 1995, ISMB.

[31]  John R. Koza,et al.  Parallel Genetic Programming on a Network of Transputers , 1995 .

[32]  Simon Handley,et al.  Predicting Whether Or Not a 60-Base DNA Sequence Contains a Centrally-Located Splice Site Using Genetic Programming , 1995 .