Protein Fold Discovery Using Stochastic Logic Programs

This chapter starts with a general introduction to protein folding. We then present a probabilistic method of dealing with multi-class classification, in particular multi-class protein fold prediction, using Stochastic Logic Programs (SLPs). Multi-class prediction attempts to classify an observed datum or example into its proper classification given that it has been tested to have multiple predictions. We apply an SLP parameter estimation algorithm to a previous study in the protein fold prediction area, in which logic programs have been learned by Inductive Logic Programming (ILP) and a large number of multiple predictions have been detected. On the basis of several experiments, we demonstrate that PILP approaches (eg. SLPs) have advantages for solving multi-class (protein fold) prediction problems with the help of learned probabilities. In addition, we show that SLPs outperform ILP plus majority class predictor in both predictive accuracy and result interpretability.

[1]  Neil D. Lawrence,et al.  Missing Data in Kernel PCA , 2006, ECML.

[2]  Stephen Muggleton,et al.  Relational Rule Induction with CProgol4.4: A Tutorial Introduction , 2001 .

[3]  Thomas Gärtner,et al.  Fisher Kernels for Logical Sequences , 2004, ECML.

[4]  C. Chothia,et al.  Understanding protein structure: using scop for fold interpretation. , 1996, Methods in enzymology.

[5]  Stephen Muggleton,et al.  Learning Stochastic Logic Programs , 2000, Electron. Trans. Artif. Intell..

[6]  Chih-Jen Lin,et al.  Probability Estimates for Multi-class Classification by Pairwise Coupling , 2003, J. Mach. Learn. Res..

[7]  Yves Deville,et al.  Multi-class protein fold classification using a new ensemble machine learning approach. , 2003, Genome informatics. International Conference on Genome Informatics.

[8]  Stephen Muggleton,et al.  The Effect of Relational Background Knowledge on Learning of Protein Three-Dimensional Fold Signatures , 2001, Machine Learning.

[9]  Stephen Muggleton,et al.  Learning Structure and Parameters of Stochastic Logic Programs , 2002, ILP.

[10]  Tim J. P. Hubbard,et al.  SCOP database in 2004: refinements integrate structure and sequence family data , 2004, Nucleic Acids Res..

[11]  David J. Hand,et al.  A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems , 2001, Machine Learning.

[12]  Dino Pedreschi,et al.  Machine Learning: ECML 2004 , 2004, Lecture Notes in Computer Science.

[13]  Kristian Kersting,et al.  TildeCRF: Conditional Random Fields for Logical Sequences , 2006, ECML.

[14]  L. De Raedt,et al.  Logical Hidden Markov Models , 2011, J. Artif. Intell. Res..

[15]  Luc De Raedt,et al.  Kernels on Prolog Proof Trees: Statistical Learning in the ILP Setting , 2006, J. Mach. Learn. Res..

[16]  Chris H. Q. Ding,et al.  Multi-class protein fold recognition using support vector machines and neural networks , 2001, Bioinform..

[17]  James Cussens,et al.  Parameter Estimation in Stochastic Logic Programs , 2001, Machine Learning.

[18]  Stephen H Muggleton,et al.  The automatic discovery of structural principles describing protein fold space. , 2003, Journal of molecular biology.

[19]  Henrik Boström,et al.  Resolving rule conflicts with double induction , 2004, Intell. Data Anal..

[20]  Dan Roth,et al.  A Sequential Model for Multi-Class Classification , 2001, EMNLP.

[21]  John Moult,et al.  Rigorous performance evaluation in protein structure modelling and implications for computational biology , 2006, Philosophical Transactions of the Royal Society B: Biological Sciences.

[22]  Luc De Raedt,et al.  05051 Abstracts Collection - Probabilistic, Logical and Relational Learning - Towards a Synthesis , 2005, Probabilistic, Logical and Relational Learning.

[23]  M J Sternberg,et al.  Automated discovery of structural signatures of protein fold and function. , 2001, Journal of molecular biology.

[24]  Frances M. G. Pearl,et al.  The CATH Domain Structure Database and related resources Gene3D and DHS provide comprehensive domain family information for genome analysis , 2004, Nucleic Acids Res..

[25]  S. Ishii,et al.  A multi-class predictor based on a probabilistic model: application to gene expression profiling-based diagnosis of thyroid tumors , 2006, BMC Genomics.

[26]  Dan Roth,et al.  Constraint Classification: A New Approach to Multiclass Classification , 2002, ALT.