Towards ultrahigh dimensional feature selection for big data

In this paper, we present a new adaptive feature scaling scheme for ultrahigh-dimensional feature selection on Big Data, and then reformulate it as a convex semi-infinite programming (SIP) problem. To address the SIP, we propose an efficient feature generating paradigm. Different from traditional gradient-based approaches that conduct optimization on all input features, the proposed paradigm iteratively activates a group of features, and solves a sequence of multiple kernel learning (MKL) subproblems. To further speed up the training, we propose to solve the MKL subproblems in their primal forms through a modified accelerated proximal gradient approach. Due to such optimization scheme, some efficient cache techniques are also developed. The feature generating paradigm is guaranteed to converge globally under mild conditions, and can achieve lower feature selection bias. Moreover, the proposed method can tackle two challenging tasks in feature selection: 1) group-based feature selection with complex structures, and 2) nonlinear feature selection with explicit feature mappings. Comprehensive experiments on a wide range of synthetic and real-world data sets of tens of million data points with O(1014) features demonstrate the competitive performance of the proposed method over state-of-the-art feature selection methods in terms of generalization performance and training effciency.

[1]  M. Kloft,et al.  l p -Norm Multiple Kernel Learning , 2011 .

[2]  Julien Mairal,et al.  Convex and Network Flow Optimization for Structured Sparsity , 2011, J. Mach. Learn. Res..

[3]  S. V. N. Vishwanathan,et al.  Multiple Kernel Learning and the SMO Algorithm , 2010, NIPS.

[4]  Zenglin Xu,et al.  Non-monotonic feature selection , 2009, ICML '09.

[5]  Naoki Abe,et al.  Group Orthogonal Matching Pursuit for Logistic Regression , 2011, AISTATS.

[6]  Anirban Dasgupta,et al.  Feature selection methods for text classification , 2007, KDD '07.

[7]  Jianxin Wu,et al.  Efficient HIK SVM Learning for Image Classification , 2012, IEEE Transactions on Image Processing.

[8]  Marc Teboulle,et al.  A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems , 2009, SIAM J. Imaging Sci..

[9]  Jieping Ye,et al.  Training SVM with indefinite kernels , 2008, ICML '08.

[10]  P. Tseng Convergence of a Block Coordinate Descent Method for Nondifferentiable Minimization , 2001 .

[11]  Tong Zhang,et al.  Analysis of Multi-stage Convex Relaxation for Sparse Regularization , 2010, J. Mach. Learn. Res..

[12]  Andrew Zisserman,et al.  Efficient additive kernels via explicit feature maps , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[13]  Robert Tibshirani,et al.  1-norm Support Vector Machines , 2003, NIPS.

[14]  Volker Roth,et al.  The Group-Lasso for generalized linear models: uniqueness of solutions and efficient algorithms , 2008, ICML '08.

[15]  Shai Shalev-Shwartz,et al.  Stochastic dual coordinate ascent methods for regularized loss , 2012, J. Mach. Learn. Res..

[16]  Paul S. Bradley,et al.  Feature Selection via Concave Minimization and Support Vector Machines , 1998, ICML.

[17]  M. Sion On general minimax theorems , 1958 .

[18]  Sayan Mukherjee,et al.  Choosing Multiple Parameters for Support Vector Machines , 2002, Machine Learning.

[19]  Katya Scheinberg,et al.  Noname manuscript No. (will be inserted by the editor) Efficient Block-coordinate Descent Algorithms for the Group Lasso , 2022 .

[20]  Glenn Fung,et al.  A Feature Selection Newton Method for Support Vector Machine Classification , 2004, Comput. Optim. Appl..

[21]  Nello Cristianini,et al.  Learning the Kernel Matrix with Semidefinite Programming , 2002, J. Mach. Learn. Res..

[22]  Dell Zhang,et al.  Extracting key-substring-group features for text classification , 2006, KDD '06.

[23]  Julien Mairal,et al.  Proximal Methods for Hierarchical Sparse Coding , 2010, J. Mach. Learn. Res..

[24]  J. E. Kelley,et al.  The Cutting-Plane Method for Solving Convex Programs , 1960 .

[25]  Jiawei Han,et al.  Generalized Fisher Score for Feature Selection , 2011, UAI.

[26]  Aixia Guo,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2014 .

[27]  Chih-Jen Lin,et al.  A Comparison of Optimization Methods and Software for Large-scale L1-regularized Linear Classification , 2010, J. Mach. Learn. Res..

[28]  Klaus-Robert Müller,et al.  Efficient and Accurate Lp-Norm Multiple Kernel Learning , 2009, NIPS.

[29]  Zenglin Xu,et al.  An Extended Level Method for Efficient Multiple Kernel Learning , 2008, NIPS.

[30]  Ping Li,et al.  One Permutation Hashing , 2012, NIPS.

[31]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[32]  Thorsten Joachims,et al.  Training linear SVMs in linear time , 2006, KDD '06.

[33]  Chih-Jen Lin,et al.  A dual coordinate descent method for large-scale linear SVM , 2008, ICML '08.

[34]  P. Bühlmann,et al.  The group lasso for logistic regression , 2008 .

[35]  Subhransu Maji,et al.  Max-margin additive classifiers for detection , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[36]  Carla E. Brodley,et al.  Spam Filtering Using Inexact String Matching in Explicit Feature Space with On-Line Linear Classifiers , 2006, TREC.

[37]  Huan Liu,et al.  Spectral feature selection for supervised and unsupervised learning , 2007, ICML '07.

[38]  Cun-Hui Zhang Nearly unbiased variable selection under minimax concave penalty , 2010, 1002.4734.

[39]  S. Yun,et al.  An accelerated proximal gradient algorithm for nuclear norm regularized linear least squares problems , 2009 .

[40]  Ivor W. Tsang,et al.  Learning Sparse SVM for Feature Selection on Very High Dimensional Datasets , 2010, ICML.

[41]  Arkadi Nemirovski,et al.  Prox-Method with Rate of Convergence O(1/t) for Variational Inequalities with Lipschitz Continuous Monotone Operators and Smooth Convex-Concave Saddle Point Problems , 2004, SIAM J. Optim..

[42]  Mário A. T. Figueiredo,et al.  Gradient Projection for Sparse Reconstruction: Application to Compressed Sensing and Other Inverse Problems , 2007, IEEE Journal of Selected Topics in Signal Processing.

[43]  John Langford,et al.  Sparse Online Learning via Truncated Gradient , 2008, NIPS.

[44]  Lorenzo Torresani,et al.  Scalable object-class retrieval with approximate and top-k ranking , 2011, 2011 International Conference on Computer Vision.

[45]  Kilian Q. Weinberger,et al.  Feature hashing for large scale multitask learning , 2009, ICML '09.

[46]  Sayan Mukherjee,et al.  Feature Selection for SVMs , 2000, NIPS.

[47]  S. Yun,et al.  An accelerated proximal gradient algorithm for nuclear norm regularized linear least squares problems , 2009 .

[48]  K. Rieck,et al.  Large Scale Learning with String Kernels , 2006 .

[49]  S. V. N. Vishwanathan,et al.  SPF-GMKL: generalized multiple kernel learning with a million kernels , 2012, KDD.

[50]  Jieping Ye,et al.  Efficient L1/Lq Norm Regularization , 2010, ArXiv.

[51]  Jun Liu,et al.  Efficient `1=`q Norm Regularization , 2010 .

[52]  M. Kloft,et al.  On the convergence rate of l p -norm multiple kernel learning , 2012 .

[53]  Y. Nesterov Gradient methods for minimizing composite objective function , 2007 .

[54]  Pradeep Ravikumar,et al.  Greedy Algorithms for Structurally Constrained High Dimensional Problems , 2011, NIPS.

[55]  Yves Grandvalet,et al.  Adaptive Scaling for Feature Selection in SVMs , 2002, NIPS.

[56]  I. Song,et al.  Working Set Selection Using Second Order Information for Training Svm, " Complexity-reduced Scheme for Feature Extraction with Linear Discriminant Analysis , 2022 .

[57]  Alain Rakotomamonjy,et al.  Variable Selection Using SVM-based Criteria , 2003, J. Mach. Learn. Res..

[58]  O. Chapelle Multi-Class Feature Selection with Support Vector Machines , 2008 .

[59]  M. Kloft,et al.  Efficient and Accurate ` p-Norm Multiple Kernel Learning , 2009 .

[60]  Zenglin Xu,et al.  Simple and Efficient Multiple Kernel Learning by Group Lasso , 2010, ICML.

[61]  Sebastian Nowozin,et al.  Infinite Kernel Learning , 2008, NIPS 2008.

[62]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[63]  Lin Xiao,et al.  Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization , 2009, J. Mach. Learn. Res..

[64]  Nuno Vasconcelos,et al.  Direct convex relaxations of sparse SVM , 2007, ICML '07.

[65]  Andrew Y. Ng,et al.  On Feature Selection: Learning with Exponentially Many Irrelevant Features as Training Examples , 1998, ICML.

[66]  Dean P. Foster,et al.  A Risk Ratio Comparison of $l_0$ and $l_1$ Penalized Regression , 2015, 1510.06319.

[67]  Gunnar Rätsch,et al.  Large Scale Multiple Kernel Learning , 2006, J. Mach. Learn. Res..

[68]  Francis R. Bach,et al.  High-Dimensional Non-Linear Variable Selection through Hierarchical Kernel Learning , 2009, ArXiv.

[69]  Fei-Fei Li,et al.  Hierarchical semantic indexing for large scale image retrieval , 2011, CVPR 2011.

[70]  Francesco Orabona,et al.  Ultra-Fast Optimization Algorithm for Sparse Multi Kernel Learning , 2011, ICML.

[71]  Yoram Singer,et al.  Efficient Online and Batch Learning Using Forward Backward Splitting , 2009, J. Mach. Learn. Res..

[72]  Chia-Hua Ho,et al.  An improved GLMNET for l1-regularized logistic regression , 2011, J. Mach. Learn. Res..

[73]  Stephen P. Boyd,et al.  Cutting-set methods for robust convex optimization with pessimizing oracles , 2009, Optim. Methods Softw..

[74]  Julien Mairal,et al.  Convex optimization with sparsity-inducing norms , 2011 .

[75]  Alexander Zien,et al.  lp-Norm Multiple Kernel Learning , 2011, J. Mach. Learn. Res..

[76]  Ping Li,et al.  Hashing Algorithms for Large-Scale Learning , 2011, NIPS.

[77]  Angelia Nedic,et al.  Subgradient Methods for Saddle-Point Problems , 2009, J. Optimization Theory and Applications.

[78]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[79]  D K Smith,et al.  Numerical Optimization , 2001, J. Oper. Res. Soc..

[80]  Ivor W. Tsang,et al.  Minimax Sparse Logistic Regression for Very High-Dimensional Feature Selection , 2013, IEEE Transactions on Neural Networks and Learning Systems.

[81]  David Kim,et al.  Feature Selection Methods for Improving Protein Structure Prediction with Rosetta , 2007, NIPS.

[82]  Jiawei Han,et al.  Correlated multi-label feature selection , 2011, CIKM '11.

[83]  Michael I. Jordan,et al.  Multiple kernel learning, conic duality, and the SMO algorithm , 2004, ICML.

[84]  Eric P. Xing,et al.  Tree-Guided Group Lasso for Multi-Task Regression with Structured Sparsity , 2009, ICML.

[85]  Ivor W. Tsang,et al.  A Feature Selection Method for Multivariate Performance Measures , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[86]  Eric P. Xing,et al.  Online Multiple Kernel Learning for Structured Prediction , 2010, 1010.2770.

[87]  Manik Varma,et al.  More generality in efficient multiple kernel learning , 2009, ICML '09.

[88]  Chih-Jen Lin,et al.  Training and Testing Low-degree Polynomial Data Mappings via Linear SVM , 2010, J. Mach. Learn. Res..

[89]  Jason Weston,et al.  Large-Scale Learning with String Kernels , 2007 .

[90]  Chih-Jen Lin,et al.  Working Set Selection Using Second Order Information for Training Support Vector Machines , 2005, J. Mach. Learn. Res..

[91]  Cun-Hui Zhang,et al.  The sparsity and bias of the Lasso selection in high-dimensional linear regression , 2008, 0808.0967.

[92]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[93]  Johannes O. Royset,et al.  On Solving Large-Scale Finite Minimax Problems Using Exponential Smoothing , 2011, J. Optim. Theory Appl..

[94]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[95]  Francis R. Bach,et al.  Structured Variable Selection with Sparsity-Inducing Norms , 2009, J. Mach. Learn. Res..