Locally-Linear Learning Machines (L3M)

We present locally-linear learning machines (L3M) for multi-class classification. We formulate a global convex risk function to jointly learn linear feature space partitions and region-specific linear classifiers. L3M’s features such as: (1) discriminative power similar to Kernel SVMs and Adaboost; (2) tight control on generalization error; (3) low training time cost due to on-line training; (4) low test-time costs due to local linearity; are all potentially well-suited for “big-data” applications. We derive tight convex surrogates for the empirical risk function associated with space partitioning classifiers. These empirical risk functions are non-convex since they involve products of indicator functions. We obtain a global convex surrogate by first embedding empirical risk loss as an extremal point of an optimization problem and then convexifying this resulting problem. Using the proposed convex formulation, we demonstrate improvement in classification performance, test and training time relative to common discriminative learning methods on challenging multiclass data sets.

[1]  Carla E. Brodley,et al.  Multivariate decision trees , 2004, Machine Learning.

[2]  Marc Toussaint,et al.  Learning discontinuities with products-of-sigmoids for switching between local models , 2005, ICML.

[3]  Fernando José Von Zuben,et al.  Hybridizing mixtures of experts with support vector machines: Investigation into nonlinear dynamic systems identification , 2007, Inf. Sci..

[4]  Shai Shalev-Shwartz,et al.  Online Learning and Online Convex Optimization , 2012, Found. Trends Mach. Learn..

[5]  Venkatesh Saligrama,et al.  Local Supervised Learning through Space Partitioning , 2012, NIPS.

[6]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[7]  Ronald L. Rivest,et al.  Training a 3-node neural network is NP-complete , 1988, COLT '88.

[8]  Martin Zinkevich,et al.  Online Convex Programming and Generalized Infinitesimal Gradient Ascent , 2003, ICML.

[9]  J. Langford Tutorial on Practical Prediction Theory for Classification , 2005, J. Mach. Learn. Res..

[10]  R. Tibshirani,et al.  Discriminant Analysis by Gaussian Mixtures , 1996 .

[11]  Bernhard Schölkopf,et al.  New Support Vector Algorithms , 2000, Neural Computation.

[12]  Nello Cristianini,et al.  Enlarging the Margins in Perceptron Decision Trees , 2000, Machine Learning.

[13]  Ohad Shamir,et al.  There's a Hole in My Data Space: Piecewise Predictors for Heterogeneous Learning Problems , 2012, AISTATS.

[14]  Kristin P. Bennett,et al.  Bilinear separation of two sets inn-space , 1993, Comput. Optim. Appl..

[15]  Dale Schuurmans,et al.  Convex Relaxations of Latent Variable Training , 2007, NIPS.

[16]  P. Utgoff,et al.  Multivariate Decision Trees , 1995, Machine Learning.

[17]  Lijuan Cao,et al.  Support vector machines experts for time series forecasting , 2003, Neurocomputing.

[18]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[19]  Chih-Jen Lin,et al.  A comparison of methods for multiclass support vector machines , 2002, IEEE Trans. Neural Networks.

[20]  G. Wahba,et al.  Multicategory Support Vector Machines , Theory , and Application to the Classification of Microarray Data and Satellite Radiance Data , 2004 .

[21]  David J. Spiegelhalter,et al.  Machine Learning, Neural and Statistical Classification , 2009 .

[22]  Shuicheng Yan,et al.  Locally adaptive classification piloted by uncertainty , 2006, ICML '06.

[23]  Lorenzo Rosasco,et al.  Multiclass Learning with Simplex Coding , 2012, NIPS.

[24]  Nimrod Megiddo,et al.  On the complexity of polyhedral separability , 1988, Discret. Comput. Geom..

[25]  Eduardo Sontag VC dimension of neural networks , 1998 .

[26]  Felix Schlenk,et al.  Proof of Theorem 2 , 2005 .

[27]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..