Scoring relevancy of features based on combinatorial analysis of Lasso with application to lymphoma diagnosis

One challenge in applying bioinformatic tools to clinical or biological data is high number of features that might be provided to the learning algorithm without any prior knowledge on which ones should be used. In such applications, the number of features can drastically exceed the number of training instances which is often limited by the number of available samples for the study. The Lasso is one of many regularization methods that have been developed to prevent overfitting and improve prediction performance in high-dimensional settings. In this paper, we propose a novel algorithm for feature selection based on the Lasso and our hypothesis is that defining a scoring scheme that measures the "quality" of each feature can provide a more robust feature selection method. Our approach is to generate several samples from the training data by bootstrapping, determine the best relevance-ordering of the features for each sample, and finally combine these relevance-orderings to select highly relevant features. In addition to the theoretical analysis of our feature scoring scheme, we provided empirical evaluations on six real datasets from different fields to confirm the superiority of our method in exploratory data analysis and prediction performance. For example, we applied FeaLect, our feature scoring algorithm, to a lymphoma dataset, and according to a human expert, our method led to selecting more meaningful features than those commonly used in the clinics. This case study built a basis for discovering interesting new criteria for lymphoma diagnosis. Furthermore, to facilitate the use of our algorithm in other applications, the source code that implements our algorithm was released as FeaLect, a documented R package in CRAN.

[1]  A. Zelenetz,et al.  Overview of lymphoma diagnosis and management. , 2008, Radiologic clinics of North America.

[2]  Robert Tibshirani,et al.  An Introduction to the Bootstrap CHAPMAN & HALL/CRC , 1993 .

[3]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[4]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[5]  Steve R. Gunn,et al.  Result Analysis of the NIPS 2003 Feature Selection Challenge , 2004, NIPS.

[6]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[7]  W Hiddemann,et al.  Lymphoma classification--the gap between biology and clinical management is closing. , 1996, Blood.

[8]  Peng Zhao,et al.  On Model Selection Consistency of Lasso , 2006, J. Mach. Learn. Res..

[9]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[10]  Robert Tibshirani,et al.  The Entire Regularization Path for the Support Vector Machine , 2004, J. Mach. Learn. Res..

[11]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[12]  Zoubin Ghahramani,et al.  Spectral Methods for Automatic Multiscale Data Clustering , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[13]  Ronald A. Cole,et al.  Spoken Letter Recognition , 1990, HLT.

[14]  Francis R. Bach,et al.  Model-Consistent Sparse Estimation through the Bootstrap , 2009, ArXiv.

[15]  Karim Lounici Sup-norm convergence rate and sign concentration property of Lasso and Dantzig estimators , 2008, 0801.4610.

[16]  N. Aghaeepour,et al.  Automated analysis of multidimensional flow cytometry data improves diagnostic accuracy between mantle cell lymphoma and small lymphocytic lymphoma. , 2012, American journal of clinical pathology.

[17]  R. Tibshirani,et al.  Regression shrinkage and selection via the lasso: a retrospective , 2011 .

[18]  Mikhail Belkin,et al.  Consistency of spectral clustering , 2008, 0804.0678.

[19]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[20]  Terrence J. Sejnowski,et al.  Analysis of hidden units in a layered network trained to classify sonar targets , 1988, Neural Networks.

[21]  Marina Vannucci,et al.  Variable selection for discriminant analysis with Markov random field priors for the analysis of microarray data , 2011, Bioinform..

[22]  Francis R. Bach,et al.  Bolasso: model consistent Lasso estimation through the bootstrap , 2008, ICML '08.

[23]  N. Meinshausen,et al.  Consistent neighbourhood selection for sparse high-dimensional graphs with the Lasso , 2004 .

[24]  N. Meinshausen,et al.  LASSO-TYPE RECOVERY OF SPARSE REPRESENTATIONS FOR HIGH-DIMENSIONAL DATA , 2008, 0806.0145.

[25]  Martin J. Wainwright,et al.  Sharp Thresholds for High-Dimensional and Noisy Sparsity Recovery Using $\ell _{1}$ -Constrained Quadratic Programming (Lasso) , 2009, IEEE Transactions on Information Theory.

[26]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[27]  Arvind Gupta,et al.  Data reduction for spectral clustering to analyze high throughput flow cytometry data , 2010, BMC Bioinformatics.

[28]  Susan A. Murphy,et al.  Monographs on statistics and applied probability , 1990 .