Evaluation of Feature Ranking Ensembles for High-Dimensional Biomedical Data: A Case Study

Developing accurate, reliable and easy to use diagnostic tests is based upon identifying a small set of highly discriminative biomarkers. This task can be cast as feature selection within a pattern recognition context. Medical data are usually of the "wide" type where the number of features is substantially larger than the number of instances. With the abundance of feature ranking methods, it is difficult to pick the most suitable one and choose a final consistent feature subset. Ensembles of ranking methods have been recommended for the task but their stability and accuracy have not been evaluated across different ranking methods. Here we present a case study consisting of 429 samples of exhaled air from smokers, 83% of whom suffer from COPD (chronic obstructive pulmonary disease). The task is to identify a discriminative subset of the 1929 volatile organic compounds (VOCs) measured through mass spectrometry. Using Pareto analysis, 16 feature ranking ensembles were evaluated with respect to three criteria: classification accuracy, area under the ROC curve and the stability of the ensemble selection. The t-statistic was rated the best among the 16 feature rankers, outperforming the currently favourite SVM ranker.

[1]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[2]  C. Furlanello,et al.  Recursive feature elimination with random forest for PTR-MS analysis of agroindustrial products , 2006 .

[3]  Ludmila I. Kuncheva,et al.  A stability index for feature selection , 2007, Artificial Intelligence and Applications.

[4]  Ching Choi,et al.  World Health Statistics 2007 [Book Review] , 2008 .

[5]  Tin Kam Ho,et al.  The Random Subspace Method for Constructing Decision Forests , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[6]  Yvan Saeys,et al.  Discriminative and informative features for biomolecular text mining with ensemble feature selection , 2010, Bioinform..

[7]  Giorgio Valentini,et al.  Feature Selection Combined with Random Subspace Ensemble for Gene Expression Based Diagnosis of Malignancies , 2004, WIRN.

[8]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[9]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[10]  Tianzi Jiang,et al.  A combinational feature selection and ensemble neural network method for classification of gene expression data , 2004, BMC Bioinformatics.

[11]  Thibault Helleputte,et al.  Robust biomarker identification for cancer diagnosis with ensemble feature selection methods , 2010, Bioinform..

[12]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[13]  Dmitrij Frishman,et al.  Pitfalls of supervised feature selection , 2009, Bioinform..

[14]  Wilker Altidor,et al.  A noise-based stability evaluation of threshold-based feature selection techniques , 2011, 2011 IEEE International Conference on Information Reuse & Integration.

[15]  Yue Han,et al.  A Variance Reduction Framework for Stable Feature Selection , 2010, 2010 IEEE International Conference on Data Mining.

[16]  Angelika Bayer,et al.  Computer Analysis of Images and Patterns , 2011, Lecture Notes in Computer Science.

[17]  Josef Kittler,et al.  Improving Stability of Feature Selection Methods , 2007, CAIP.

[18]  Gavin C. Cawley,et al.  Gene Selection in Cancer Classification using Sparse Logistic Regression with Bayesian Regularisation , 2006 .

[19]  Pierre Geurts,et al.  Proteomic mass spectra classification using decision tree based ensemble methods , 2005, Bioinform..

[20]  Robert P. Sheridan,et al.  Random Forest: A Classification and Regression Tool for Compound Classification and QSAR Modeling , 2003, J. Chem. Inf. Comput. Sci..

[21]  Gavin Brown,et al.  Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection , 2012, J. Mach. Learn. Res..

[22]  Blaise Hanczar,et al.  Small-sample precision of ROC-related estimates , 2010, Bioinform..

[23]  Marcel J. T. Reinders,et al.  Random subspace method for multivariate feature selection , 2006, Pattern Recognit. Lett..

[24]  Rainer Goebel,et al.  Combining multivariate voxel selection and support vector machines for mapping and classification of fMRI spatial patterns , 2008, NeuroImage.

[25]  Xinghua Lu,et al.  Feature selection for fMRI-based deception detection , 2009, BMC Bioinformatics.

[26]  A. B. Robinson,et al.  Quantitative analysis of urine vapor and breath by gas-liquid partition chromatography. , 1971, Proceedings of the National Academy of Sciences of the United States of America.