------Bioinformatics is emerging as a new area of research recently by combining computer science and biology for automatic analysis and modeling of biological data. The volume of data generated from the next generation sequencing projects is growing enormously. The data consists of DNA, RNA and protein sequences which contain extremely important information about genes, their structure and function. Computational techniques which involve machine learning and pattern recognition are becoming useful in biological data mining. The process of classifying protein sequences into family /superfamily based on the primary sequence is a very complex and open problem. Although, there are many problems in protein superfamily classification, however the three major issues are the selection of suitable feature encoding method, extraction of an optimized subset of features having higher discriminatory information for the representation of protein sequence and adaptation of an appropriate classification technique that classify sequences with highest classification accuracy. In this paper, we propose a distance based feature encoding technique for extraction of features; the performance of the proposed technique is validated with different classifiers, which show better results than the previously available techniques. The average classification accuracy achieved is 91.2% on the benchmark dataset downloaded from the renowned UniProtKB database. KeywordsFeature encoding; Data Mining; Feature selection; Superfamily; Protein Classification Algorithm; _______________________________________________________________________________________________
[1]
Sanghamitra Bandyopadhyay,et al.
An efficient technique for superfamily classification of amino acid sequences: feature extraction, fuzzy clustering and prototype selection
,
2005,
Fuzzy Sets Syst..
[2]
Santanu Kumar Rath,et al.
An efficient technique for protein classification using feature extraction by artificial neural networks
,
2010,
2010 Annual IEEE India Conference (INDICON).
[3]
Santanu Kumar Rath,et al.
Protein superfamily Classification using Adaptive Evolutionary Radial Basis Function Network
,
2012,
Int. J. Comput. Intell. Appl..
[4]
Xue-wen Chen,et al.
On Position-Specific Scoring Matrix for Protein Function Prediction
,
2011,
IEEE/ACM Transactions on Computational Biology and Bioinformatics.
[5]
S. Katebi,et al.
Protein Superfamily Classification Using Fuzzy
,
2009
.
[6]
Rituparna Chaki,et al.
A Brief Review of Data Mining Application Involving Protein Sequence Classification
,
2012,
ACITY.
[7]
Dennis Shasha,et al.
New techniques for extracting features from protein sequences
,
2001,
IBM Syst. J..
[8]
S. Katebi,et al.
Protein Superfamily Classification Using Fuzzy Rule-Based Classifier
,
2009,
IEEE Transactions on NanoBioscience.
[9]
Keith C. C. Chan,et al.
UPSEC: An Algorithm for Classifying Unaligned Protein Sequences into Functional Families
,
2008,
J. Comput. Biol..