Fast flux discriminant for large-scale sparse nonlinear classification

In this paper, we propose a novel supervised learning method, Fast Flux Discriminant (FFD), for large-scale nonlinear classification. Compared with other existing methods, FFD has unmatched advantages, as it attains the efficiency and interpretability of linear models as well as the accuracy of nonlinear models. It is also sparse and naturally handles mixed data types. It works by decomposing the kernel density estimation in the entire feature space into selected low-dimensional subspaces. Since there are many possible subspaces, we propose a submodular optimization framework for subspace selection. The selected subspace predictions are then transformed to new features on which a linear model can be learned. Besides, since the transformed features naturally expect non-negative weights, we only require smooth optimization even with the L1 regularization. Unlike other nonlinear models such as kernel methods, the FFD model is interpretable as it gives importance weights on the original features. Its training and testing are also much faster than traditional kernel models. We carry out extensive empirical studies on real-world datasets and show that the proposed model achieves state-of-the-art classification results with sparsity, interpretability, and exceptional scalability. Our model can be learned in minutes on datasets with millions of samples, for which most existing nonlinear methods will be prohibitively expensive in space and time.

[1]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[2]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[3]  Kilian Q. Weinberger,et al.  Feature hashing for large scale multitask learning , 2009, ICML '09.

[4]  Andrew Zisserman,et al.  Efficient additive kernels via explicit feature maps , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[5]  Mark W. Schmidt,et al.  Fast Optimization Methods for L1 Regularization: A Comparative Study and Two New Approaches , 2007, ECML.

[6]  Ping Li,et al.  b-Bit minwise hashing , 2009, WWW '10.

[7]  Chia-Hua Ho,et al.  Recent Advances of Large-Scale Linear Classification , 2012, Proceedings of the IEEE.

[8]  Yixin Chen,et al.  Density-based logistic regression , 2013, KDD.

[9]  Dimitris Achlioptas,et al.  Database-friendly random projections: Johnson-Lindenstrauss with binary coins , 2003, J. Comput. Syst. Sci..

[10]  Chih-Jen Lin,et al.  A Practical Guide to Support Vector Classication , 2008 .

[11]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[12]  Francis R. Bach,et al.  Learning with Submodular Functions: A Convex Optimization Perspective , 2011, Found. Trends Mach. Learn..

[13]  Hui Lin,et al.  Multi-document Summarization via Budgeted Maximization of Submodular Functions , 2010, NAACL.

[14]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[15]  Vahab S. Mirrokni,et al.  Maximizing Non-Monotone Submodular Functions , 2011, 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07).

[16]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[17]  Vahab Mirrokni,et al.  Maximizing Non-Monotone Submodular Functions , 2007, FOCS 2007.

[18]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[19]  Andreas Krause,et al.  SFO: A Toolbox for Submodular Function Optimization , 2010, J. Mach. Learn. Res..

[20]  Jasper Snoek,et al.  Practical Bayesian Optimization of Machine Learning Algorithms , 2012, NIPS.

[21]  Ben Taskar,et al.  The Pairwise Piecewise-Linear Embedding for Efficient Non-Linear Classification , 2013, ICML.

[22]  Ming-Syan Chen,et al.  Efficient Kernel Approximation for Large-Scale Support Vector Machine Classification , 2011, SDM.

[23]  Kenneth Ward Church,et al.  Very sparse random projections , 2006, KDD '06.

[24]  Alexander J. Smola,et al.  Fastfood - Computing Hilbert Space Expansions in loglinear time , 2013, ICML.

[25]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[26]  P. J. Green,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[27]  Daniela M. Witten,et al.  An Introduction to Statistical Learning: with Applications in R , 2013 .

[28]  Rasmus Pagh,et al.  Fast and scalable polynomial kernels via explicit feature maps , 2013, KDD.

[29]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[30]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[31]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[32]  Rishabh K. Iyer,et al.  Fast Semidifferential-based Submodular Function Optimization , 2013, ICML.

[33]  Harish Karnick,et al.  Random Feature Maps for Dot Product Kernels , 2012, AISTATS.

[34]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.