Feature Encoding Technique For Efficient Classification Of Protein Sequences

------Bioinformatics is emerging as a new area of research recently by combining computer science and biology for automatic analysis and modeling of biological data. The volume of data generated from the next generation sequencing projects is growing enormously. The data consists of DNA, RNA and protein sequences which contain extremely important information about genes, their structure and function. Computational techniques which involve machine learning and pattern recognition are becoming useful in biological data mining. The process of classifying protein sequences into family /superfamily based on the primary sequence is a very complex and open problem. Although, there are many problems in protein superfamily classification, however the three major issues are the selection of suitable feature encoding method, extraction of an optimized subset of features having higher discriminatory information for the representation of protein sequence and adaptation of an appropriate classification technique that classify sequences with highest classification accuracy. In this paper, we propose a distance based feature encoding technique for extraction of features; the performance of the proposed technique is validated with different classifiers, which show better results than the previously available techniques. The average classification accuracy achieved is 91.2% on the benchmark dataset downloaded from the renowned UniProtKB database. KeywordsFeature encoding; Data Mining; Feature selection; Superfamily; Protein Classification Algorithm; _______________________________________________________________________________________________