Feature Selection for Classification with Proteomic Data of Mixed Quality

In this paper we assess experimentally the performance of two state-of-the-art feature selection methods, called RFE and RELIEF, when used for classifying pattern proteomic samples of mixed quality. The data are generated by spiking human sera to artificially create differentiable sample groups, and by handling samples at different storage temperature. We consider two type of classifiers: support vector machines (SVM) and k-nearest neighbour (kNN). Results of leave-one-out cross validation (LOOCV) experiments indicate that RELIEF selects more stable feature subsets than RFE over the runs, where the selected features are mainly spiked ones. However, RFE outperforms RELIEF in terms of (average LOOCV) accuracy, both when combined with SVM and kNN. Perfect LOOCV accuracy is obtained by RFE combined with 1NN. Almost all the samples that are wrongly classified by the algorithms have high storage temperature. The results of experiments on this data indicate that when samples of mixed quality are analyzed computationally, feature selection of only relevant (spiked) features does not necessarily correspond to highest accuracy of classification.

[1]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[2]  Eric P. Xing Feature Selection in Microarray Analysis , 2003 .

[3]  Larry A. Rendell,et al.  A Practical Approach to Feature Selection , 1992, ML.

[4]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[5]  Larry A. Rendell,et al.  The Feature Selection Problem: Traditional Methods and a New Algorithm , 1992, AAAI.

[6]  D. Ransohoff Lessons from controversy: ovarian cancer screening and serum proteomics. , 2005, Journal of the National Cancer Institute.

[7]  Eliot Marshall,et al.  Getting the Noise Out of Gene Arrays , 2004, Science.

[8]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[9]  E. Petricoin,et al.  Serum proteomic patterns for detection of prostate cancer. , 2002, Journal of the National Cancer Institute.

[10]  E. Diamandis Analysis of serum proteomic patterns for early cancer diagnosis: drawing attention to potential problems. , 2004, Journal of the National Cancer Institute.

[11]  Huan Liu,et al.  Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution , 2003, ICML.

[12]  Elena Marchiori,et al.  Feature selection in proteomic pattern data with support vector machines , 2004, 2004 Symposium on Computational Intelligence in Bioinformatics and Computational Biology.

[13]  E. Marchiori,et al.  Sample handling for mass spectrometric proteomic investigations of human sera. , 2005, Analytical chemistry.

[14]  Massimiliano Pontil,et al.  Leave One Out Error, Stability, and Generalization of Voting Combinations of Classifiers , 2004, Machine Learning.

[15]  Hiroshi Motoda,et al.  Feature Extraction, Construction and Selection: A Data Mining Perspective , 1998 .

[16]  E. Petricoin,et al.  Use of proteomic patterns in serum to identify ovarian cancer , 2002, The Lancet.

[17]  Axthonv G. Oettinger,et al.  IEEE Transactions on Information Theory , 1998 .

[18]  Thomas P Conrads,et al.  SELDI-TOF MS for diagnostic proteomics. , 2003, Analytical chemistry.

[19]  D. Chan,et al.  Proteomics and bioinformatics approaches for identification of serum biomarkers to detect breast cancer. , 2002, Clinical chemistry.

[20]  Stefan Michiels,et al.  Prediction of cancer outcome with microarrays: a multiple random validation strategy , 2005, The Lancet.

[21]  Anil K. Jain,et al.  Dimensionality reduction using genetic algorithms , 2000, IEEE Trans. Evol. Comput..

[22]  J. Glimm,et al.  Detection of cancer-specific markers amid massive mass spectral data , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[23]  Huiqing Liu,et al.  A comparative study on feature selection and classification methods using gene expression profiles and proteomic patterns. , 2002, Genome informatics. International Conference on Genome Informatics.

[24]  Ron Kohavi,et al.  Irrelevant Features and the Subset Selection Problem , 1994, ICML.

[25]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[26]  P. Schellhammer,et al.  Boosted decision tree analysis of surface-enhanced laser desorption/ionization mass spectral serum profiles discriminates prostate cancer from noncancer patients. , 2002, Clinical chemistry.

[27]  Byung Ro Moon,et al.  Local search-embedded genetic algorithms for feature selection , 2002, Object recognition supported by user interaction for service robots.