Additive Approximations in High Dimensional Nonparametric Regression via the SALSA

High dimensional nonparametric regression is an inherently difficult problem with known lower bounds depending exponentially in dimension. A popular strategy to alleviate this curse of dimensionality has been to use additive models of \emph{first order}, which model the regression function as a sum of independent functions on each dimension. Though useful in controlling the variance of the estimate, such models are often too restrictive in practical settings. Between non-additive models which often have large variance and first order additive models which have large bias, there has been little work to exploit the trade-off in the middle via additive models of intermediate order. In this work, we propose SALSA, which bridges this gap by allowing interactions between variables, but controls model capacity by limiting the order of interactions. SALSA minimises the residual sum of squares with squared RKHS norm penalties. Algorithmically, it can be viewed as Kernel Ridge Regression with an additive kernel. When the regression function is additive, the excess risk is only polynomial in dimension. Using the Girard-Newton formulae, we efficiently sum over a combinatorial number of terms in the additive expansion. Via a comparison on $15$ real datasets, we show that our method is competitive against $21$ other alternatives.

[1]  N. Aronszajn Theory of Reproducing Kernels. , 1950 .

[2]  I. G. MacDonald,et al.  Symmetric functions and Hall polynomials , 1979 .

[3]  J. Freidman,et al.  Multivariate adaptive regression splines , 1991 .

[4]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[5]  T. Plate ACCURACY VERSUS INTERPRETABILITY IN FLEXIBLE MODELING : IMPLEMENTING A TRADEOFF USING GAUSSIAN PROCESS MODELS , 1999 .

[6]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[7]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[8]  Bernhard Schölkopf,et al.  Generalization Performance of Regularization Networks and Support Vector Machines via Entropy Numbers of Compact Operators , 1998 .

[9]  Adam Krzyzak,et al.  A Distribution-Free Theory of Nonparametric Regression , 2002, Springer series in statistics.

[10]  Anthony Widjaja,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2003, IEEE Transactions on Neural Networks.

[11]  Hao Helen Zhang,et al.  Component selection and smoothing in smoothing spline analysis of variance models -- COSSO , 2003 .

[12]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[13]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[14]  A. Berlinet,et al.  Reproducing kernel Hilbert spaces in probability and statistics , 2004 .

[15]  Larry A. Wasserman,et al.  Rodeo: Sparse Nonparametric Regression in High Dimensions , 2005, NIPS.

[16]  Tong Zhang,et al.  Learning Bounds for Kernel Regression Using Effective Data Dimensionality , 2005, Neural Computation.

[17]  J. Reade Eigenvalues of integral operators with smooth positive definite kernels , 2005 .

[18]  R. Nichol,et al.  Cosmological constraints from the SDSS luminous red galaxies , 2006, astro-ph/0608632.

[19]  Hao Helen Zhang,et al.  Component selection and smoothing in multivariate nonparametric regression , 2006, math/0702659.

[20]  J. Lafferty,et al.  Sparse additive models , 2007, 0711.4555.

[21]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[22]  Michael W. Mahoney,et al.  PCA-Correlated SNPs for Structure Identification in Worldwide Human Populations , 2007, PLoS genetics.

[23]  Andreas Christmann,et al.  Support vector machines , 2008, Data Mining and Knowledge Discovery Handbook.

[24]  AI Koan,et al.  Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning , 2008, NIPS.

[25]  Francis R. Bach,et al.  Consistency of the group Lasso and multiple kernel learning , 2007, J. Mach. Learn. Res..

[26]  Don R. Hush,et al.  Optimal Rates for Regularized Least Squares Regression , 2009, COLT.

[27]  Alexandre B. Tsybakov,et al.  Introduction to Nonparametric Estimation , 2008, Springer series in statistics.

[28]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[29]  Tom Michael Mitchell,et al.  A Neurosemantic Theory of Concrete Noun Representation Based on the Underlying Brain Codes , 2010, PloS one.

[30]  Zenglin Xu,et al.  Simple and Efficient Multiple Kernel Learning by Group Lasso , 2010, ICML.

[31]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[32]  Carl E. Rasmussen,et al.  Additive Gaussian Processes , 2011, NIPS.

[33]  Ethem Alpaydin,et al.  Multiple Kernel Learning Algorithms , 2011, J. Mach. Learn. Res..

[34]  B. Yandell,et al.  Integrative Analysis of a Cross-Loci Regulation Network Identifies App as a Gene Regulating Insulin Secretion from Pancreatic Islets , 2012, PLoS genetics.

[35]  Rama Chellappa,et al.  Computationally Efficient Regression on a Dependency Graph for Human Pose Estimation , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[36]  Alexander J. Smola,et al.  Fastfood: Approximate Kernel Expansions in Loglinear Time , 2014, ArXiv.

[37]  Martin J. Wainwright,et al.  Divide and Conquer Kernel Ridge Regression , 2013, COLT.

[38]  Johannes Gehrke,et al.  Accurate intelligible models with pairwise interactions , 2013, KDD.

[39]  Brian Murphy,et al.  Simultaneously Uncovering the Patterns of Brain Regions Involved in Different Story Reading Subprocesses , 2014, PloS one.

[40]  Le Song,et al.  Scalable Kernel Methods via Doubly Stochastic Gradients , 2014, NIPS.

[41]  R. Tibshirani,et al.  Generalized Additive Models , 1986 .