Comparison of unsupervised feature selection methods for high-dimensional regression problems in prediction of peptide binding affinity

Identification of robust set of predictive features is one of the most important steps in the construction of clustering, classification and regression models from many thousands of features. Although there have been various attempts to select predictive feature sets from high-dimensional data sets in classification and clustering, there is a limited attempt to study it in regression problems. As semi-supervised and supervised feature selection methods tend to identify noisy features in addition to discriminative variables, unsupervised feature selection methods (USFSMs) are generally regarded as more unbiased approach. Therefore, in this study, along with the entire feature set, four different USFSMs are considered for the quantitative prediction of peptide binding affinities being one of the most challenging post-genome regression problems of very high-dimension comparted to extremely small size of samples. As USFSMs are independent of any predictive method, support vector regression was then utilised to assess the quality of prediction. Given three different peptide binding affinity data sets, the results suggest that the regression performance of USFMs depends generally on the datasets. There is no particular method that yields the best performance compared to their performances in the classification problems. However, a closer investigation of the results appears to suggest that the spectral regression-based approach yields slightly better performance. To the best of our knowledge, this is the first study that presents comprehensive comparison of USFSMs in such high-dimensional regression problems, particularly in biological domain with an application in the prediction of peptide binding affinity, and provides a number of practical suggestions for future practitioners.

[1]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[2]  Yang Liu,et al.  Unsupervised discriminative feature selection in a kernel space via L2,1-norm minimization , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[3]  Xuelong Li,et al.  Joint Embedding Learning and Sparse Regression: A Framework for Unsupervised Feature Selection , 2014, IEEE Transactions on Cybernetics.

[4]  Huseyin Seker,et al.  Support vector-based Takagi-Sugeno fuzzy system for the prediction of binding affinity of peptides , 2013, 2013 35th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC).

[5]  Huseyin Seker,et al.  CISAPS: Complex Informational Spectrum for the Analysis of Protein Sequences , 2015, Adv. Bioinformatics.

[6]  Huan Liu,et al.  Semi-supervised Feature Selection via Spectral Analysis , 2007, SDM.

[7]  R. K. Agrawal,et al.  An Introductory Study on Time Series Modeling and Forecasting , 2013, ArXiv.

[8]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[9]  Deng Cai,et al.  Unsupervised feature selection for multi-cluster data , 2010, KDD.

[10]  Huan Liu,et al.  Spectral feature selection for supervised and unsupervised learning , 2007, ICML '07.

[11]  Hiroshi Motoda,et al.  Feature Selection for Knowledge Discovery and Data Mining , 1998, The Springer International Series in Engineering and Computer Science.

[12]  Deng Cai,et al.  Laplacian Score for Feature Selection , 2005, NIPS.

[13]  Alexander Zien,et al.  Semi-Supervised Learning , 2006 .

[14]  Lei Shi,et al.  Robust Spectral Learning for Unsupervised Feature Selection , 2014, 2014 IEEE International Conference on Data Mining.

[15]  Huan Liu,et al.  Feature Selection for Classification , 1997, Intell. Data Anal..