Rule clustering and super-rule generation for transmembrane segments prediction

The explanation of a decision is important for the acceptance of machine learning technology in bioinformatics applications such as protein structure prediction. In past research, we have already combined SVM with decision tree to extract rules for understanding transmembrane segments prediction. However, rules we have gotten are as many as about 20,000. This large number of rules makes them difficult for us to interpret their meaning. In this paper, a novel approach of rule clustering (SVM/spl I.bar/DT/spl I.bar/C) for super-rule generation is presented. We use K-means clustering to cluster huge number of rules to generate many new super-rules. The experimental results show that the super-rules produced by SVM/spl I.bar/DT/spl I.bar/C can be analyzed manually by a researcher, and these super-rules are not only new but also achieve very high transmembrane prediction accuracy (exceeding 95%) most of the times.

[1]  J. Ross Quinlan,et al.  Improved Use of Continuous Attributes in C4.5 , 1996, J. Artif. Intell. Res..

[2]  A. Kernytsky,et al.  Transmembrane helix predictions revisited , 2002, Protein science : a publication of the Protein Society.

[3]  Yi Pan,et al.  Transmembrane segments prediction with support vector machine based on high performance encoding schemes , 2004, 2004 Symposium on Computational Intelligence in Bioinformatics and Computational Biology.

[4]  Rolf Apweiler,et al.  A collection of well characterised integral membrane proteins , 2000, Bioinform..

[5]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[6]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[7]  Albert Y. Zomaya,et al.  An overview of protein-folding techniques: issues and perspectives , 2005, Int. J. Bioinform. Res. Appl..