Improving prediction accuracy for protein structure classification by neural network using feature combination

The classification of protein structures is essential for their function determination in bioinformatics. At present time, a reasonably high rate of prediction accuracy has been achieved in classifying proteins into four classes in the SCOP. However, it is still a challenge for classifying proteins into fine-grained folding categories, especially when the number of possible folding patterns as those defined in the SCOP is large. In our previous work, we have proposed a hierarchical learning architecture (HLA), two indirect coding features, and a gate function to differentiate proteins according to their classes and folding patterns. Our prediction accuracy rate for 27 folding categories was 65.5% compared favorably to previous results by Ding and Dubchak with 56.5% prediction accuracy rate. The success of the protein structure classification depends on two factors: the computational methods used and the features selected. In this paper, we use a combinatorial fusion analysis technique to facilitate feature selection and combination for improving predictive accuracy in protein structure classification. When applying the combinatorial fusion to our previous work, the resulting classification has an overall prediction accuracy rate of 87.8% for four classes and 70.9% for 27 folding categories. These rates are significantly higher than our previous work and demonstrate that combinatorial fusion is a valuable method for protein structure classification.

[1]  Cheng-Yan Kao,et al.  Combination methods in microarray analysis , 2004, 7th International Symposium on Parallel Architectures, Algorithms and Networks, 2004. Proceedings..

[2]  Hongfang Liu,et al.  Identifying significant genes from microarray data , 2004, Proceedings. Fourth IEEE Symposium on Bioinformatics and Bioengineering.

[3]  Cathy H. Wu,et al.  Neural networks and genome informatics , 2000 .

[4]  Chuen-Der Huang,et al.  Hierarchical learning architecture with automatic feature selection for multiclass protein fold classification , 2003, IEEE Transactions on NanoBioscience.

[5]  Chris H. Q. Ding,et al.  Multi-class protein fold recognition using support vector machines and neural networks , 2001, Bioinform..

[6]  Stuart M. Brown,et al.  Selection and validation of differentially expressed genes in head and neck cancer , 2004, Cellular and Molecular Life Sciences CMLS.

[7]  Paul B. Kantor,et al.  Predicting the effectiveness of Naïve data fusion on the basis of system characteristics , 2000 .

[8]  Tim J. P. Hubbard,et al.  SCOP: a Structural Classification of Proteins database , 1999, Nucleic Acids Res..

[9]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[10]  Chuan Yi Tang,et al.  Feature selection and combination criteria for improving predictive accuracy in protein structure classification , 2005, Fifth IEEE Symposium on Bioinformatics and Bioengineering (BIBE'05).

[11]  K. Chou,et al.  Prediction of protein structural classes. , 1995, Critical reviews in biochemistry and molecular biology.

[12]  D. Frank Hsu,et al.  Consensus Scoring Criteria for Improving Enrichment in Virtual Screening , 2005, J. Chem. Inf. Model..

[13]  Adam Krzyżak,et al.  Methods of combining multiple classifiers and their applications to handwriting recognition , 1992, IEEE Trans. Syst. Man Cybern..

[14]  D. Frank Hsu,et al.  A Study of Data Fusion in Cayley Graphs G(S{n}, P{n}). , 2004 .

[15]  B. Rost,et al.  Prediction of protein secondary structure at better than 70% accuracy. , 1993, Journal of molecular biology.

[16]  D. Frank Hsu,et al.  Comparing Rank and Score Combination Methods for Data Fusion in Information Retrieval , 2005, Information Retrieval.

[17]  Hui-Huang Hsu,et al.  Advanced Data Mining Technologies in Bioinformatics , 2006 .