Machine Learning based Approach for protein Function Prediction using Sequence Derived Properties

Protein function prediction is an important and challenging field in Bioinformatics. There are various machine learning based approaches have been proposed to predict the protein functions using sequence derived properties. In this paper 857 sequence-derived features such as amino acid composition, dipeptide composition, correlation, composition, transition and distribution and pseudo amino acid composition are used with various machine learning based approaches such as Random Forest, Support Vector Machine (SVM), k-Nearest Neighbor (k-NN), and fuzzy k-Nearest Neighbor (k-NN) to predict the protein functions. This paper used various feature selection techniques such as Correlation Feature Selection, Gain Ratio, Information Gain, One R attribute, ReliefF to select the optimal number of features. The performance of various classifiers with optimal number of features obtained by various feature selection techniques. The comparative analysis of result shows that the random forest based method with reliefF provide the overall accuracy of 89.20% and Matthews’s correlation coefficient (MCC) 0.87% that is better to others.

[1]  Constantin F. Aliferis,et al.  Are Random Forests Better than Support Vector Machines for Microarray-Based Cancer Classification? , 2007, AMIA.

[2]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[3]  Y. Z. Chen,et al.  Protein function classification via support vector machine approach. , 2003, Mathematical biosciences.

[4]  Keun Ho Ryu,et al.  Identification of protein functions using a machine-learning approach based on sequence-derived properties , 2009, Proteome Science.

[5]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[6]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[7]  V. Radha,et al.  PERFORMANCE EVALUATION OF STATISTICAL CLASSIFIERS USING INDIAN SIGN LANGUAGE DATASETS , 2011 .

[8]  Kevin Burrage,et al.  Prediction of protein solvent accessibility using support vector machines , 2002, Proteins.

[9]  Johan A. K. Suykens,et al.  Least Squares Support Vector Machine Classifiers , 1999, Neural Processing Letters.

[10]  Sarah Jane Delany k-Nearest Neighbour Classifiers , 2007 .

[11]  James M. Keller,et al.  A fuzzy K-nearest neighbor algorithm , 1985, IEEE Transactions on Systems, Man, and Cybernetics.

[12]  X. Chen,et al.  SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence , 2003, Nucleic Acids Res..