Application of neural networks for protein sequence classification

Protein sequence classification is modelled as a binary classification problem where an unlabeled protein sequence is checked to see if it belongs to a known set of protein superfamilies or not. In this paper we used multilayer perceptrons with supervised learning algorithm to learn the binary classification. The training data consists of two sets-a positive set belonging to an identified set of protein superfamily and a negative set comprising sequences from other superfamilies. When applying neural networks the first problem to be addressed is feature extraction. In this paper we used the new feature extraction techniques proposed by Wang et al. Simulations reveal that the neural network is able to classify with good precision for myosin and photochrome superfamilies in the data set that we have chosen as positive. Also the results for globin superfamily are good, thus validating the methodology of feature extraction and the application of neural networks for protein sequence classification as suggested by Wang et al. But, for Actin and Ribonuclease superfamilies the network showed poor performance. One possible reason for this may be that the choice of sequences in the negative data set is not optimal. We conclude from this work that the classification performance depends upon a proper selection of sequences for positive and negative data sets.