Coordination number prediction using learning classifier systems: performance and interpretability

The prediction of the coordination number (CN) of an amino acid in a protein structure has recently received renewed attention. In a recent paper, Kinjo et al. proposed a real-valued definition of CN and a criterion to map it onto a finite set of classes, in order to predict it using classification approaches. The literature reports several kinds of input information used for CN prediction. The aim of this paper is to assess the performance of a state-of-the-art learning method, Learning Classifier Systems (LCS) on this CN definition, with various degrees of precision, based on several combinations of input attributes. Moreover, we will compare the LCS performance to other well-known learning techniques. Our experiments are also intended to determinethe minimum set of input information needed to achieve good predictive performance, so as to generate competent yet simple and interpretable classification rules. Thus, the generated predictors (rule sets) are analyzed for their interpretability.

[1]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[2]  K. Nishikawa,et al.  Predicting absolute contact numbers of native protein structure from amino acid sequence , 2004, Proteins.

[3]  Lashon B. Booker,et al.  Recombination Distributions for Genetic Algorithms , 1992, FOGA.

[4]  Robert M. MacCallum,et al.  Striped sheets and protein contact prediction , 2004, ISMB/ECCB.

[5]  Martin V. Butz,et al.  Speeding-Up Pittsburgh Learning Classifier Systems: Modeling Time and Accuracy , 2004, PPSN.

[6]  Pierre Baldi,et al.  The Principled Design of Large-Scale Recursive Neural Network Architectures--DAG-RNNs and the Protein Structure Prediction Problem , 2003, J. Mach. Learn. Res..

[7]  Chris Sander Databases of homology-derived protein structures , 1990 .

[8]  Kenneth A. De Jong,et al.  Using genetic algorithms for concept learning , 1993, Machine Learning.

[9]  Stewart W. Wilson Classifier Fitness Based on Accuracy , 1995, Evolutionary Computation.

[10]  金田 重郎,et al.  C4.5: Programs for Machine Learning (書評) , 1995 .

[11]  George Karypis,et al.  Prediction of contact maps using support vector machines , 2003, Third IEEE Symposium on Bioinformatics and Bioengineering, 2003. Proceedings..

[12]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[13]  D T Jones,et al.  Protein secondary structure prediction based on position-specific scoring matrices. , 1999, Journal of molecular biology.

[14]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[15]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[16]  Jaume Bacardit Peñarroya Pittsburgh genetic-based machine learning in the data mining era: representations, generalization, and run-time , 2004 .

[17]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[18]  Jacek Blazewicz,et al.  From HP Lattice Models to Real Proteins: Coordination Number Prediction Using Learning Classifier Systems , 2006, EvoWorkshops.

[19]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[20]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[21]  K. De Jong,et al.  Using Genetic Algorithms for Concept Learning , 2004, Machine Learning.

[22]  Hideo Matsuda,et al.  PDB-REPRDB: a database of representative protein chains from the Protein Data Bank (PDB) , 2001, Nucleic Acids Res..

[23]  Huan Liu,et al.  Discretization: An Enabling Technique , 2002, Data Mining and Knowledge Discovery.

[24]  Christopher Bystroff,et al.  Predicting interresidue contacts using templates and pathways , 2003, Proteins.

[25]  Geoffrey J. Barton,et al.  Protein sequence alignments: a strategy for the hierarchical analysis of residue conservation , 1993, Comput. Appl. Biosci..