Predicting the Subcellular Localization of Proteins with Multiple Sites Based on Multiple Features Fusion

Protein sub-cellular localization prediction has attracted much attention in recent years because of its importance for protein function studying and targeted drug discovery, and that makes it to be an important research field in bioinformatics. Traditional experimental methods which ascertain the protein sub-cellular locations are costly and time consuming. In the last two decades, machine learning methods got increasing development, and a large number of machine learning based protein sub-cellular location predictors have been developed. However, most of such predictors can only predict proteins in only one subcellular location. With the development of biology techniques, more and more proteins which have two or even more sub-cellular locations have been found. It is much more significant to study such proteins because they have extremely useful implication for both basic biology and bioinformatics research. In order to improve the accuracy of prediction, much more feature information which can represent the protein sequence should be extracted. In this paper, several feature extraction methods were fused together to extract the feature information, then the multi-label knearest neighbors (ML-KNN) algorithm was used to predict protein sub-cellular locations. The best overall accuracies we got for dataset s1 in constructing Gpos-mploc is 66.7304 and 59.9206 percent for dataset s2 in constructing Virus-mPLoc.

[1]  S. Brunak,et al.  Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. , 2000, Journal of molecular biology.

[2]  Pufeng Du,et al.  Predicting multisite protein subcellular locations: progress and challenges , 2013, Expert review of proteomics.

[3]  Bo Yang,et al.  Flexible neural trees ensemble for stock index modeling , 2007, Neurocomputing.

[4]  Cheng Wu,et al.  Prediction of nuclear receptors with optimal pseudo amino acid composition. , 2009, Analytical biochemistry.

[5]  De-Shuang Huang,et al.  Independent component analysis-based penalized discriminant method for tumor classification using gene expression data , 2006, Bioinform..

[6]  De-Shuang Huang,et al.  Cancer classification using Rotation Forest , 2008, Comput. Biol. Medicine.

[7]  K. Chou,et al.  Virus-mPLoc: A Fusion Classifier for Viral Protein Subcellular Location Prediction by Incorporating Multiple Sites , 2010, Journal of biomolecular structure & dynamics.

[8]  Zhi-Hua Zhou,et al.  Multilabel Neural Networks with Applications to Functional Genomics and Text Categorization , 2006, IEEE Transactions on Knowledge and Data Engineering.

[9]  Lei Zhang,et al.  Prediction of protein-protein interactions based on protein-protein correlation using least squares regression. , 2014, Current protein & peptide science.

[10]  Gajendra P. S. Raghava,et al.  BhairPred: prediction of β-hairpins in a protein from multiple alignment information using ANN and SVM techniques , 2005, Nucleic Acids Res..

[11]  Changjun Jiang,et al.  A New Strategy for Protein Interface Identification Using Manifold Learning Method , 2014, IEEE Transactions on NanoBioscience.

[12]  Kuo-Chen Chou,et al.  A New Method for Predicting the Subcellular Localization of Eukaryotic Proteins with Both Single and Multiple Sites: Euk-mPLoc 2.0 , 2010, PloS one.

[13]  Zhang Hua-xiang Modified KNN algorithm for multi-label learning , 2011 .

[14]  Zhi-Hua Zhou,et al.  ML-KNN: A lazy learning approach to multi-label learning , 2007, Pattern Recognit..

[15]  K. Nakai Protein sorting signals and prediction of subcellular localization. , 2000, Advances in protein chemistry.