A Feature Selection Method based on SVM and ReliefF and its Application in the Analysis of HPLC-MS Data

Liquid chromatography-mass spectrometry (HPLC-MS) has shown its power in metabolomic study. Due to the high dimension of the HPLC-MS data, many multivariate analysis techniques, such as principal component analysis, partial least-squares discriminant analysis, random forest and support vector machine, have been applied in processing the HPLC-MS data. Support vector machine (SVM) [1] is a very popular classification method based on the statistic theory. In constructing the learning model, it also measures the weights of the variables. But the HPLC-MS data usually contains hundreds of variables, some of them are non-related with the problem which may affect the produced super-plane, further influences the variable weights. To select the most informative ones from the HPLC-MS data, we combine SVM with ReliefF [2] to conduct the recursive feature elimination (SVM-RFE-ReliefF). In each loop, the SVM weights and the ReliefF values are both computed, a proportion of the low ranked features by the two measurements are deleted. A metabonomics data of liver diseases from UPLC/Q-TOF MS platform, which contains 2428 ion features and 60 samples including 30 cirrhosis patients, 30 HCC patients was used to show the performance of our method. In order to validate the selected features, 30 control samples were also collected. The results showed that the accuracy rate of our method in distinguishing HCC from cirrhosis is 98.17%±0.95%, which is better than 97.5%±1.62% from SVM-recursive feature elimination (SVM-RFE), This implies that our method could select more discriminative features than SVM-RFE.