Massively-parallel best subset selection for ordinary least-squares regression

Selecting an optimal subset of k out of d features for linear regression models given n training instances is often considered intractable for feature spaces with hundreds or thousands of dimensions. We propose an efficient massively-parallel implementation for selecting such optimal feature subsets in a brute-force fashion for small k. By exploiting the enormous compute power provided by modern parallel devices such as graphics processing units, it can deal with thousands of input dimensions even using standard commodity hardware only. We evaluate the practical runtime using artificial datasets and sketch the applicability of our framework in the context of astronomy.

[1]  S. Gezari,et al.  OF GENES AND MACHINES: APPLICATION OF A COMBINATION OF MACHINE LEARNING TOOLS TO ASTRONOMY DATA SETS , 2016, 1603.00967.

[2]  E. Ishida,et al.  On the realistic validation of photometric redshifts , 2017, 1701.08748.

[3]  Hilo,et al.  THE ELEVENTH AND TWELFTH DATA RELEASES OF THE SLOAN DIGITAL SKY SURVEY: FINAL DATA FROM SDSS-III , 2015, 1501.00963.

[4]  Kotagiri Ramamohanarao,et al.  MASCOT: Fast and Highly Scalable SVM Cross-Validation Using GPUs and SSDs , 2014, 2014 IEEE International Conference on Data Mining.

[5]  Tao Wang,et al.  Deep learning with COTS HPC systems , 2013, ICML.

[6]  D. Bertsimas,et al.  Best Subset Selection via a Modern Optimization Lens , 2015, 1507.03133.

[7]  Raffaele D'Abrusco,et al.  Astroinformatics of galaxies and quasars: a new general method for photometric redshifts estimation , 2011, 1107.3160.

[8]  E. Ishida,et al.  The first analytical expression to estimate photometric redshifts suggested by a machine , 2013, 1308.4145.

[9]  Balas K. Natarajan,et al.  Sparse Approximate Solutions to Linear Systems , 1995, SIAM J. Comput..

[10]  Y. Wadadekar Estimating Photometric Redshifts Using Support Vector Machines , 2004, astro-ph/0412005.

[11]  B. Moghaddam,et al.  Sparse regression as a sparse eigenvalue problem , 2008, 2008 Information Theory and Applications Workshop.

[12]  Marc Hofmann,et al.  Efficient algorithms for computing the best subset regression models for large-scale problems , 2007, Comput. Stat. Data Anal..

[13]  S. G. Djorgovski,et al.  Feature selection strategies for classifying high dimensional astronomical data sets , 2013, 2013 IEEE International Conference on Big Data.

[14]  N. Davey,et al.  Photometric redshift estimation using Gaussian processes , 2009 .

[15]  Kurt Keutzer,et al.  Fast support vector machine training and classification on graphics processors , 2008, ICML '08.

[16]  F. Gieseke,et al.  Finding new high-redshift quasars by asking the neighbours , 2012, 1210.7071.

[17]  Christian Igel,et al.  Improving the performance of photometric regression models via massive parallel feature selection , 2014 .

[18]  France,et al.  Photometric Redshifts based on standard SED fitting procedures , 2000 .

[19]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[20]  Nicolas Pinto,et al.  PyCUDA and PyOpenCL: A scripting-based approach to GPU run-time code generation , 2009, Parallel Comput..

[21]  N. Benı́tez Bayesian Photometric Redshift Estimation , 1998, astro-ph/9811189.

[22]  Fabian Gieseke,et al.  bufferkdtree: A Python library for massive nearest neighbor queries on multi-many-core devices , 2017, Knowl. Based Syst..

[23]  Fabian Gieseke,et al.  Buffer k-d Trees: Processing Massive Nearest Neighbor Queries on GPUs , 2014, ICML.

[24]  David M. Beazley,et al.  SWIG: An Easy to Use Tool for Integrating Scripting Languages with C and C++ , 1996, Tcl/Tk Workshop.

[25]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..