A Comparative Study of Machine Learning and Evolutionary Computation Approaches for Protein Secondary Structure Classification

Proteins are essential to life and they have countless biological functions. Proteins are synthesized in the ribosome of cells following a template given by the messenger RNA (mRNA). During the synthesis, the protein folds into a unique three-dimensional structure, known as native conformation. This process is called protein folding. The biological function of a protein depends on its three-dimensional conformation, which in turn, is a function of its primary and secondary structures. It is known that ill-formed proteins can be completely inactive or even harmful to the organism. Several diseases are believed to result from the accumulation of ill-formed proteins, such as Alzheimer’s disease, cystic fibrosis, Huntington’s disease and some types of cancer. Therefore, acquiring knowledge about the secondary structure of proteins is an important issue, since such knowledge can lead to important medical and biochemical advancements and even to the development of new drugs with specific functionality. A possible way to infer the full structure of an unknown protein is to identify potential secondary structures in it. However, the pattern formation rules of secondary structure of proteins are still not known precisely. This paper aims at applying Machine Learning and Evolutionary Computation methods to define suitable classifiers for predicting the secondary structure of proteins, starting from their primary structure (that is, their linear sequence of amino acids). The organization of this paper is as follows: in Section 2 we introduce some basic concepts and some important aspects of molecular biology, computational methods for classification tasks and the protein classification problem. Next, in Sections 3 and 4, we present, respectively, a review of the machine learning and evolutionary computation methods used in this work. In Section 6, we describe the methodology applied to develop the comparison of different classification algorithms. Next, Section 7, the computational experiments and results are detailed. Finally, in the last Section 8, discussion about results, conclusions and future directions are pointed out. 12

[1]  Baris E. Suzek,et al.  The Universal Protein Resource (UniProt) in 2010 , 2009, Nucleic Acids Res..

[2]  J. Drenth Principles of protein x-ray crystallography , 1994 .

[3]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[4]  Ian Witten,et al.  Data Mining , 2000 .

[5]  T. Watanabe,et al.  Classification and function estimation of protein by using data compression and genetic algorithms , 2001, Proceedings of the 2001 Congress on Evolutionary Computation (IEEE Cat. No.01TH8546).

[6]  Holger H. Hoos,et al.  An ant colony optimisation algorithm for the 2D and 3D hydrophobic polar protein folding problem , 2005, BMC Bioinformatics.

[7]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[8]  Cândida Ferreira,et al.  Gene Expression Programming: A New Adaptive Algorithm for Solving Problems , 2001, Complex Syst..

[9]  Heitor Silvério Lopes Evolutionary Algorithms for the Protein Folding Problem: A Review and Current Trends , 2008, Computational Intelligence in Biomedicine and Bioinformatics.

[10]  Sotiris B. Kotsiantis,et al.  Supervised Machine Learning: A Review of Classification Techniques , 2007, Informatica.

[11]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[12]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[13]  Georgina Mirceva,et al.  HMM based approach for classifying protein structures , 2009 .

[14]  Roni Rosenfeld,et al.  Learning Hidden Markov Model Structure for Information Extraction , 1999 .

[15]  Z. Luthey-Schulten,et al.  Ab initio protein structure prediction. , 2002, Current opinion in structural biology.

[16]  L. Pauling,et al.  Configurations of Polypeptide Chains With Favored Orientations Around Single Bonds: Two New Pleated Sheets. , 1951, Proceedings of the National Academy of Sciences of the United States of America.

[17]  Saman K. Halgamuge,et al.  Particle Swarm Optimisation for Protein Motif Discovery , 2004, Genetic Programming and Evolvable Machines.

[18]  Carole A. Goble,et al.  Clustering Techniques in Biological Sequence Analysis , 1997, PKDD.

[19]  Robert Stevens,et al.  Protein classification using ontology classification , 2006, ISMB.

[20]  K. Wüthrich NMR of proteins and nucleic acids , 1988 .

[21]  María Martín,et al.  The Universal Protein Resource (UniProt) in 2010 , 2010 .

[22]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[23]  Heitor Silvério Lopes,et al.  A Comparative Study of Machine Learning Methods for Detecting Promoters in Bacterial DNA Sequences , 2008, ICIC.

[24]  Heitor Silvério Lopes,et al.  Neural networks for protein classification , 2004, Applied bioinformatics.

[25]  Alex Alves Freitas,et al.  On the hierarchical classification of G protein-coupled receptors , 2007, Bioinform..

[26]  R. Doolittle,et al.  A simple method for displaying the hydropathic character of a protein. , 1982, Journal of molecular biology.

[27]  Burak Erman,et al.  Minimum Energy Configurations of the 2-Dimensional HP-Model of Proteins by Self-Organizing Networks , 2002, J. Comput. Biol..

[28]  A. Lehninger Principles of Biochemistry , 1984 .

[29]  Heitor Silvério Lopes,et al.  Hierarchical Parallel Genetic Algorithm applied to the three-dimensional HP Side-chain Protein Folding Problem , 2010, 2010 IEEE International Conference on Systems, Man and Cybernetics.

[30]  Thomas Lengauer,et al.  ROCR: visualizing classifier performance in R , 2005, Bioinform..

[31]  Heitor Silvério Lopes,et al.  GEPCLASS: A Classification Rule Discovery Tool Using Gene Expression Programming , 2006, ADMA.

[32]  Dennis Shasha,et al.  Application of neural networks to biological data mining: a case study in protein sequence classification , 2000, KDD '00.

[33]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[34]  Ron Kohavi,et al.  Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid , 1996, KDD.

[35]  Haruki Nakamura,et al.  Protein structure classification by structural transformation , 1996, Proceedings IEEE International Joint Symposia on Intelligence and Systems.

[36]  Eibe Frank,et al.  Logistic Model Trees , 2003, Machine Learning.

[37]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[38]  David A. Bell,et al.  Learning Bayesian networks from data: An information-theory based approach , 2002, Artif. Intell..

[39]  H. Scheraga,et al.  Chain reversals in proteins. , 1973, Biochimica et biophysica acta.

[40]  Geoffrey I. Webb,et al.  Lazy Learning of Bayesian Rules , 2000, Machine Learning.

[41]  Gerhard Hessler,et al.  Drug Design Strategies for Targeting G‐Protein‐Coupled Receptors , 2002, Chembiochem : a European journal of chemical biology.

[42]  Yoav Freund,et al.  A Short Introduction to Boosting , 1999 .

[43]  Terry Ngo,et al.  Data mining: practical machine learning tools and technique, third edition by Ian H. Witten, Eibe Frank, Mark A. Hell , 2011, SOEN.

[44]  C. Blake,et al.  The structure of amyloid fibrils by electron microscopy and X-ray diffraction. , 1997, Advances in protein chemistry.

[45]  Geoffrey I. Webb,et al.  Not So Naive Bayes: Aggregating One-Dependence Estimators , 2005, Machine Learning.

[46]  Vincent J. Carey,et al.  Supervised Machine Learning , 2008 .

[47]  L. Pauling,et al.  The structure of proteins; two hydrogen-bonded helical configurations of the polypeptide chain. , 1951, Proceedings of the National Academy of Sciences of the United States of America.

[48]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[49]  Bernhard Pfahringer,et al.  Locally Weighted Naive Bayes , 2002, UAI.

[50]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[51]  Ian H. Witten,et al.  Generating Accurate Rule Sets Without Global Optimization , 1998, ICML.

[52]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[53]  Alex Alves Freitas,et al.  A genetic programming method for protein motif discovery and protein classification , 2011, Soft Comput..

[54]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[55]  Durbin,et al.  Biological Sequence Analysis , 1998 .

[56]  D. F. Tsunoda,et al.  Automatic motif discovery in an enzyme database using a genetic algorithm-based approach , 2006, Soft Comput..

[57]  M R Lee,et al.  State of the art in studying protein folding and protein structure prediction using molecular dynamics methods. , 2001, Journal of molecular graphics & modelling.

[58]  Heitor Silvério Lopes,et al.  A differential evolution approach for protein structure optimisation using a 2D off-lattice model , 2010, Int. J. Bio Inspired Comput..

[59]  John R. Koza,et al.  Classifying Protein Segments as Transmembrane Domains Using Genetic Programming and Architecture-Altering Operations , 1996 .

[60]  A. Griffiths Introduction to Genetic Analysis , 1976 .

[61]  G J Williams,et al.  The Protein Data Bank: a computer-based archival file for macromolecular structures. , 1978, Archives of biochemistry and biophysics.

[62]  Heitor Silvério Lopes,et al.  A Hybrid Genetic Algorithm for the Protein Folding Problem Using the 2D-HP Lattice Model , 2008 .

[63]  Geoffrey I. Webb,et al.  MultiBoosting: A Technique for Combining Boosting and Wagging , 2000, Machine Learning.

[64]  M. Moorhouse,et al.  The Protein Databank , 2005 .

[65]  Dr. Alex A. Freitas Data Mining and Knowledge Discovery with Evolutionary Algorithms , 2002, Natural Computing Series.

[66]  G J Williams,et al.  The Protein Data Bank: a computer-based archival file for macromolecular structures. , 1977, Journal of molecular biology.

[67]  C. Dobson,et al.  High-resolution molecular structure of a peptide in an amyloid fibril determined by magic angle spinning NMR spectroscopy. , 2004, Proceedings of the National Academy of Sciences of the United States of America.