Feature Selection for Pairwise Scoring Kernels with Applications to Protein Subcellular Localization

In biological sequence classification, it is common to convert variable-length sequences into fixed-length vectors via pairwise sequence comparison. This pairwise approach, however, can lead to feature vectors with dimension equal to the training set size, causing the curse of dimensionality. This calls for feature selection methods that can weed out irrelevant features to reduce training and recognition time. In this paper, we propose to train an SVM using the full-feature column vectors of a pairwise scoring matrix and select the relevant features based on the support vectors of the SVM. The idea stems from the fact that pairwise scoring matrices are symmetric and support vectors are important for classification. We refer to this approach as vector-index-adaptive SVM (VIA-SVM). We compare VIA-SVM with other feature selection schemes-including SVM-RFE, R-SVM, and a filter method based on symmetric divergence (SD)-in protein subcellular localization. Results show that VIA-SVM is able to automatically bound the number of selected features within a small range. We also found that fusion of VIA-SVM and SD can produce more compact feature subsets without decreasing prediction accuracy, and that while VIA-SVM is superior for large feature-set size, the combination of SD and VIA-SVM performs better at small feature-set size.

[1]  O. Gotoh An improved algorithm for matching biological sequences. , 1982, Journal of molecular biology.

[2]  John B. Shoven,et al.  I , Edinburgh Medical and Surgical Journal.

[3]  T. Hubbard,et al.  Using neural networks for prediction of the subcellular location of proteins. , 1998, Nucleic acids research.

[4]  Man-Wai Mak,et al.  Eukaryotic Protein Subcellular Localization Based on Local Pairwise Profile Alignment SVM , 2006, 2006 16th IEEE Signal Processing Society Workshop on Machine Learning for Signal Processing.

[5]  Ke Wang,et al.  Profile-based string kernels for remote homology detection and motif extraction , 2004, Proceedings. 2004 IEEE Computational Systems Bioinformatics Conference, 2004. CSB 2004..

[6]  Xuegong Zhang,et al.  Recursive SVM feature selection and sample classification for mass-spectrometry and microarray data , 2006, BMC Bioinformatics.

[7]  Gajendra P. S. Raghava,et al.  Prediction of subcellular localization of proteins using pairwise sequence alignment and support vector machine , 2006, Pattern Recognit. Lett..

[8]  Li Liao,et al.  Combining Pairwise Sequence Similarity and Support Vector Machines for Detecting Remote Protein Evolutionary and Structural Relationships , 2003, J. Comput. Biol..

[9]  Sun-Yuan Kung,et al.  A Solution to the Curse of Dimensionality Problem in Pairwise Scoring Techniques , 2006, ICONIP.

[10]  Ying Huang,et al.  Prediction of protein subcellular locations using fuzzy k-NN method , 2004, Bioinform..

[11]  Temple F. Smith,et al.  Comparison of biosequences , 1981 .