Parallel Feature Selection for Regularized Least-Squares

This paper introduces a parallel version of the machine learning based feature selection algorithm known as greedy regularized least-squares (RLS). The aim of such machine learning methods is to develop accurate predictive models on complex datasets. Greedy RLS is an efficient implementation of the greedy forward feature selection procedure using regularized least-squares, capable of efficiently selecting the most predictive features from large datasets. It has previously been shown, through the use of matrix algebra shortcuts, to perform feature selection in only a fraction of the time required by traditional implementations. In this paper, the algorithm is adapted to allow for efficient parallel-based feature selection in order to scale the method to run on modern clusters. To demonstrate its effectiveness in practice, we implemented it on a sample genome-wide association study, as well as a number of other high-dimensional datasets, scaling the method to up to 128 cores.

[1]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[2]  Elizabeth M. Smigielski,et al.  dbSNP: the NCBI database of genetic variation , 2001, Nucleic Acids Res..

[3]  A. E. Hoerl,et al.  Ridge regression: biased estimation for nonorthogonal problems , 2000 .

[4]  Johan A. K. Suykens,et al.  Advances in learning theory : methods, models and applications , 2003 .

[5]  Peter M Visscher,et al.  Harnessing the information contained within genome-wide association studies to improve individual prediction of complex disease risk. , 2009, Human molecular genetics.

[6]  Qianchuan He,et al.  BIOINFORMATICS ORIGINAL PAPER , 2022 .

[7]  P. Lachenbruch An almost unbiased method of obtaining confidence intervals for the probability of misclassification in discriminant analysis. , 1967, Biometrics.

[8]  A. E. Hoerl,et al.  Ridge Regression: Applications to Nonorthogonal Problems , 1970 .

[9]  Tapio Salakoski,et al.  Wrapper-based selection of genetic features in genome-wide association studies through fast matrix operations , 2012, Algorithms for Molecular Biology.

[10]  Simon C. Potter,et al.  Localization of type 1 diabetes susceptibility to the MHC class I genes HLA-B and HLA-A , 2007, Nature.

[11]  Simon C. Potter,et al.  Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls , 2007, Nature.

[12]  J. Hanley,et al.  The meaning and use of the area under a receiver operating characteristic (ROC) curve. , 1982, Radiology.

[13]  Tapio Salakoski,et al.  Speeding Up Greedy Forward Selection for Regularized Least-Squares , 2010, 2010 Ninth International Conference on Machine Learning and Applications.