A Hybrid Evolutionary Approach for the Protein Classification Problem

This paper proposes a hybrid algorithm that combines characteristics of both Genetic Programming (GP) and Genetic Algorithms (GAs), for discovering motifs in proteins and predicting their functional classes, based on the discovered motifs. In this algorithm, individuals are represented as IF-THEN classification rules. The rule antecedent consists of a combination of motifs automatically extracted from protein sequences. The rule consequent consists of the functional class predicted for a protein whose sequence satisfies the combination of motifs in the rule antecedent. The system can be used in two different ways. First, as a stand-alone classification system, where the evolved classification rules are directly used to predict the functional classes of proteins. Second, the system can be used just as an "attribute construction" method, discovering motifs that are given, as predictor attributes, to another classification algorithm. In this usage of the system, a classical decision tree induction algorithm was used as the classifier. The proposed system was evaluated in these two scenarios and compared with another Genetic Algorithm designed specifically for the discovery of motifs --- and therefore used only as an attribute construction algorithm. This comparison was performed by mining an enzyme data set extracted from the Protein Data Bank. The best results were obtained when using the proposed hybrid GP/GA as an attribute construction algorithm and performing the classification (using the constructed attributes) with the decision tree induction algorithm.

[1]  Jens Gottlieb,et al.  Evolutionary Computation in Combinatorial Optimization , 2006, Lecture Notes in Computer Science.

[2]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[3]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[4]  Alex Alves Freitas,et al.  An Evolutionary Approach for Motif Discovery and Transmembrane Protein Classification , 2005, EvoWorkshops.

[5]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[6]  Goldberg,et al.  Genetic algorithms , 1993, Robust Control Systems with Genetic Algorithms.

[7]  Carole A. Goble,et al.  Clustering Techniques in Biological Sequence Analysis , 1997, PKDD.

[8]  Marc Sebban,et al.  A data-mining approach to spacer oligonucleotide typing of Mycobacterium tuberculosis , 2002, Bioinform..

[9]  Gary M. Weiss Learning with Rare Cases and Small Disjuncts , 1995, ICML.

[10]  Douglas L. Brutlag,et al.  Remote homology detection: a motif based approach , 2003, ISMB.

[11]  Heitor Silvério Lopes,et al.  Self-Adapting Evolutionary Parameters: Encoding Aspects for Combinatorial Optimization Problems , 2005, EvoCOP.

[12]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[13]  D. E. Goldberg,et al.  Genetic Algorithms in Search, Optimization & Machine Learning , 1989 .

[14]  Amanda Clare,et al.  The utility of different representations of protein sequence for predicting functional class , 2001, Bioinform..

[15]  B. Mirkin,et al.  A Feature-Based Approach to Discrimination and Prediction of Protein Folding Groups , 2022 .

[16]  Rolf Apweiler,et al.  Automatic rule generation for protein annotation with the C4.5 data mining algorithm applied on SWISS-PROT , 2001, Bioinform..

[17]  Yang Zhang,et al.  Large-scale assessment of the utility of low-resolution protein structures for biochemical function assignment , 2004, Bioinform..

[18]  John R. Koza,et al.  Genetic programming - on the programming of computers by means of natural selection , 1993, Complex adaptive systems.

[19]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[20]  D. F. Tsunoda,et al.  Automatic motif discovery in an enzyme database using a genetic algorithm-based approach , 2006, Soft Comput..

[21]  Heitor Silvério Lopes,et al.  Neural networks for protein classification , 2004, Applied bioinformatics.

[22]  D. E. Goldberg,et al.  Genetic Algorithms in Search , 1989 .

[23]  Amanda Clare,et al.  Machine learning of functional class from phenotype data , 2002, Bioinform..

[24]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[25]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[26]  Ian Witten,et al.  Data Mining , 2000 .

[27]  Rolf Drechsler,et al.  Applications of Evolutionary Computing, EvoWorkshops 2008: EvoCOMNET, EvoFIN, EvoHOT, EvoIASP, EvoMUSART, EvoNUM, EvoSTOC, and EvoTransLog, Naples, Italy, March 26-28, 2008. Proceedings , 2008, EvoWorkshops.

[28]  Sandor Suhai Genomics and Proteomics: Functional and Computational Aspects , 2000 .

[29]  Padhraic Smyth,et al.  From Data Mining to Knowledge Discovery: An Overview , 1996, Advances in Knowledge Discovery and Data Mining.

[30]  Alex Alves Freitas,et al.  Hierarchical classification of protein function with ensembles of rules and particle swarm optimisation , 2008, Soft Comput..