Peptide programs: applying fragment programs to protein classification

Functional prediction/classification of proteins is a central problem in bioinformatics. Alignment methods are a useful approach, but have limitations, which have prompted the development and use of machine learning approaches. However, traditional machine learning approaches are unable to exploit sequence data directly, and instead use derived sequence features or Kernel functions to obtain a feature space. Because theoretically all information necessary to predict a protein's structure and function is contained in its sequence, a methodology that could exploit sequence data directly could be advantageous. A novel machine learning methodology for protein classification, inspired in the concept of fragment programs, is presented. This methodology consists in assigning a minimal computer program to each of the 20 amino acids, and then representing a protein as the program resulting from applying sequentially the programs of the amino acids which compose its sequence. The basic concepts of the methodology presented (peptide programs) are discussed and a framework is proposed for their implementation, including instruction set, virtual machine, evaluation procedures and convergence methods. The methodology is tested in the binary classification of 33,500 enzymes into 182 distinct Enzyme Commission (EC) classes. The average Matthews correlation coefficient of the binary classifiers is 0.75 in training and 0.68 in validation. Overall, the results obtained demonstrate the potential of the proposed methodology, and its ability to extract knowledge from sequence data, using very few computational resources

[1]  P. Dobson,et al.  Predicting enzyme class from protein structure without alignments. , 2005, Journal of molecular biology.

[2]  Randi J. Rost OpenGL(R) Shading Language (2nd Edition) , 2005 .

[3]  Dariusz Plewczynski,et al.  PDB-UF: database of predicted enzymatic functions for unannotated protein structures from structural genomics , 2006, BMC Bioinformatics.

[4]  Y.Z. Chen,et al.  Enzyme family classification by support vector machines , 2004, Proteins.

[5]  John R. Koza,et al.  Genetic programming - on the programming of computers by means of natural selection , 1993, Complex adaptive systems.

[6]  Jason Weston,et al.  Multi-class Protein Classification Using Adaptive Codes , 2007, J. Mach. Learn. Res..

[7]  Edward N Baker,et al.  Protein structure prediction and analysis as a tool for functional genomics. , 2003, Applied bioinformatics.

[8]  Zheng Rong Yang,et al.  Bio-basis function neural networks in protein data mining. , 2007, Current pharmaceutical design.

[9]  J. Skolnick,et al.  How well is enzyme function conserved as a function of pairwise sequence identity? , 2003, Journal of molecular biology.

[10]  Loris Nanni,et al.  Ensemblator: An ensemble of classifiers for reliable classification of biological data , 2007, Pattern Recognit. Lett..

[11]  André O. Falcão,et al.  Residue Fragment Programs for Enzyme Classification , 2005 .

[12]  Y. Freund,et al.  Profile-based string kernels for remote homology detection and motif extraction. , 2005, Journal of bioinformatics and computational biology.

[13]  Saso Dzeroski,et al.  Inductive Logic Programming: Techniques and Applications , 1993 .

[14]  Pierre Baldi,et al.  Assessing the accuracy of prediction algorithms for classification: an overview , 2000, Bioinform..

[15]  Ziding Zhang,et al.  Descriptor‐based protein remote homology identification , 2005, Protein science : a publication of the Protein Society.

[16]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[17]  Y. Z. Chen,et al.  Predicting functional family of novel enzymes irrespective of sequence similarity: a statistical learning approach , 2004, Nucleic acids research.

[18]  Randi J. Rost OpenGL shading language , 2004 .

[19]  William Stafford Noble,et al.  Support vector machine learning from heterogeneous data: an empirical analysis using protein sequence and structure , 2006, Bioinform..

[20]  Xiaodong Zhang,et al.  Mechanisms of ATPases--a multi-disciplinary approach. , 2004, Current protein & peptide science.

[21]  Claude Pasquier,et al.  PRED‐CLASS: Cascading neural networks for generalized protein classification and genome‐wide applications , 2001, Proteins.

[22]  Cathy H. Wu,et al.  Neural networks for full-scale protein sequence classification: Sequence encoding with singular value decomposition , 1995, Machine Learning.

[23]  A. Valencia,et al.  Intrinsic errors in genome annotation. , 2001, Trends in genetics : TIG.

[24]  Jack Y. Yang,et al.  Classification of proteins multiple-labelled and single-labelled with protein functional classes , 2007, Int. J. Gen. Syst..

[25]  Yu-Dong Cai,et al.  Prediction of Saccharomyces cerevisiae protein functional class from functional domain composition , 2004, Bioinform..

[26]  N. Bhardwaj,et al.  Kernel-based machine learning protocol for predicting DNA-binding proteins , 2005, Nucleic acids research.

[27]  A. Valencia,et al.  Practical limits of function prediction , 2000, Proteins.

[28]  Amos Bairoch,et al.  The SWISS-PROT protein sequence data bank, recent developments , 1993, Nucleic Acids Res..

[29]  Eleazar Eskin,et al.  Protein Family Classification Using Sparse Markov Transducers , 2000, J. Comput. Biol..

[30]  Iddo Friedberg,et al.  Automated protein function predictionçthe genomic challenge , 2006 .

[31]  J. Whisstock,et al.  Prediction of protein function from protein sequence and structure , 2003, Quarterly Reviews of Biophysics.

[32]  George Karypis,et al.  Profile-based direct kernels for remote homology detection and fold recognition , 2005, Bioinform..

[33]  N. Bhardwaj,et al.  Learning to Translate Sequence and Structure to Function: Identifying DNA Binding and Membrane Binding Proteins , 2007, Annals of Biomedical Engineering.

[34]  Xiaolong Wang,et al.  Sequence analysis Application of latent semantic analysis to protein remote homology detection , 2006 .

[35]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence data bank and its new supplement TREMBL , 1996, Nucleic Acids Res..

[36]  Christopher S. Oehmen,et al.  SVM-BALSA: Remote homology detection based on Bayesian sequence alignment , 2005, Comput. Biol. Chem..

[37]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.