N$^3$LARS: Minimum Redundancy Maximum Relevance Feature Selection for Large and High-dimensional Data

We propose a feature selection method that finds non-redundant features from a large and high-dimensional data in nonlinear way. Specifically, we propose a nonlinear extension of the non-negative least-angle regression (LARS) called N${}^3$LARS, where the similarity between input and output is measured through the normalized version of the Hilbert-Schmidt Independence Criterion (HSIC). An advantage of N${}^3$LARS is that it can easily incorporate with map-reduce frameworks such as Hadoop and Spark. Thus, with the help of distributed computing, a set of features can be efficiently selected from a large and high-dimensional data. Moreover, N${}^3$LARS is a convex method and can find a global optimum solution. The effectiveness of the proposed method is first demonstrated through feature selection experiments for classification and regression with small and high-dimensional datasets. Finally, we evaluate our proposed method over a large and high-dimensional biology dataset.

[1]  D K Smith,et al.  Numerical Optimization , 2001, J. Oper. Res. Soc..

[2]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[3]  Laxmi Parida,et al.  Transductive HSIC Lasso , 2014, SDM.

[4]  Francis R. Bach,et al.  Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning , 2008, NIPS.

[5]  Takafumi Kanamori,et al.  Mutual information estimation reveals global associations between stimuli and biological processes , 2009, BMC Bioinformatics.

[6]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[7]  Peter Kaiser,et al.  Predicting Positive p53 Cancer Rescue Regions Using Most Informative Positive (MIP) Active Learning , 2009, PLoS Comput. Biol..

[8]  Martin J. Wainwright,et al.  Minimax-Optimal Rates For Sparse Additive Models Over Kernel Classes Via Convex Programming , 2010, J. Mach. Learn. Res..

[9]  Huan Liu,et al.  Spectral Feature Selection for Data Mining , 2011 .

[10]  Lei Wang,et al.  Efficient Spectral Feature Selection with Minimum Redundancy , 2010, AAAI.

[11]  Gavin Brown,et al.  Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection , 2012, J. Mach. Learn. Res..

[12]  Mehryar Mohri,et al.  Algorithms for Learning Kernels Based on Centered Alignment , 2012, J. Mach. Learn. Res..

[13]  V. Sheffield,et al.  Regulation of gene expression in the mammalian eye and its relevance to eye disease , 2006, Proceedings of the National Academy of Sciences.

[14]  Masashi Sugiyama,et al.  Semi-supervised speaker identification under covariate shift , 2010, Signal Process..

[15]  Charles Elkan,et al.  Quadratic Programming Feature Selection , 2010, J. Mach. Learn. Res..

[16]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[17]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[18]  Masashi Sugiyama,et al.  High-Dimensional Feature Selection by Feature-Wise Kernelized Lasso , 2012, Neural Computation.

[19]  Krishnakumar Balasubramanian,et al.  Ultrahigh Dimensional Feature Screening via RKHS Embeddings , 2013, AISTATS.

[20]  J. Horowitz,et al.  VARIABLE SELECTION IN NONPARAMETRIC ADDITIVE MODELS. , 2010, Annals of statistics.

[21]  Lei Yu,et al.  Fast Correlation Based Filter (FCBF) with a different search strategy , 2008, 2008 23rd International Symposium on Computer and Information Sciences.

[22]  Ivor W. Tsang,et al.  Towards ultrahigh dimensional feature selection for big data , 2012, J. Mach. Learn. Res..

[23]  Jane Labadin,et al.  Feature selection based on mutual information , 2015, 2015 9th International Conference on IT in Asia (CITA).

[24]  Le Song,et al.  Feature Selection via Dependence Maximization , 2012, J. Mach. Learn. Res..

[25]  Lars Kai Hansen,et al.  Approximate L0 constrained non-negative matrix and tensor factorization , 2008, 2008 IEEE International Symposium on Circuits and Systems.

[26]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[27]  Masashi Sugiyama,et al.  Fast Learning Rate of Multiple Kernel Learning: Trade-Off between Sparsity and Smoothness , 2012, AISTATS.

[28]  Bernhard Schölkopf,et al.  Measuring Statistical Dependence with Hilbert-Schmidt Norms , 2005, ALT.

[29]  Rajesh P. N. Rao,et al.  This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. 1 Online Electromyographic Control of a Robotic , 2022 .

[30]  Ingo Steinwart,et al.  On the Influence of the Kernel on the Consistency of Support Vector Machines , 2002, J. Mach. Learn. Res..

[31]  Stephen J. Wright,et al.  Numerical Optimization , 2018, Fundamental Statistical Inference.

[32]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[33]  Huan Liu,et al.  Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution , 2003, ICML.

[34]  Larry A. Wasserman,et al.  Nonparametric regression and classification with joint sparsity constraints , 2008, NIPS.

[35]  J. Lafferty,et al.  Sparse additive models , 2007, 0711.4555.

[36]  Jennifer G. Dy,et al.  From Transformation-Based Dimensionality Reduction to Feature Selection , 2010, ICML.

[37]  Constantin F. Aliferis,et al.  Local Causal and Markov Blanket Induction for Causal Discovery and Feature Selection for Classification Part I: Algorithms and Empirical Evaluation , 2010, J. Mach. Learn. Res..

[38]  George Forman,et al.  BNS feature scaling: an improved representation over tf-idf for svm text classification , 2008, CIKM '08.

[39]  Volker Roth,et al.  The generalized LASSO , 2004, IEEE Transactions on Neural Networks.