Learning Sparse Additive Models with Interactions in High Dimensions

A function $f: \mathbb{R}^d \rightarrow \mathbb{R}$ is referred to as a Sparse Additive Model (SPAM), if it is of the form $f(\mathbf{x}) = \sum_{l \in \mathcal{S}}\phi_{l}(x_l)$, where $\mathcal{S} \subset [d]$, $|\mathcal{S}| \ll d$. Assuming $\phi_l$'s and $\mathcal{S}$ to be unknown, the problem of estimating $f$ from its samples has been studied extensively. In this work, we consider a generalized SPAM, allowing for second order interaction terms. For some $\mathcal{S}_1 \subset [d], \mathcal{S}_2 \subset {[d] \choose 2}$, the function $f$ is assumed to be of the form: $$f(\mathbf{x}) = \sum_{p \in \mathcal{S}_1}\phi_{p} (x_p) + \sum_{(l,l^{\prime}) \in \mathcal{S}_2}\phi_{(l,l^{\prime})} (x_{l},x_{l^{\prime}}).$$ Assuming $\phi_{p},\phi_{(l,l^{\prime})}$, $\mathcal{S}_1$ and, $\mathcal{S}_2$ to be unknown, we provide a randomized algorithm that queries $f$ and exactly recovers $\mathcal{S}_1,\mathcal{S}_2$. Consequently, this also enables us to estimate the underlying $\phi_p, \phi_{(l,l^{\prime})}$. We derive sample complexity bounds for our scheme and also extend our analysis to include the situation where the queries are corrupted with noise -- either stochastic, or arbitrary but bounded. Lastly, we provide simulation results on synthetic data, that validate our theoretical findings.

[1]  Holger Rauhut,et al.  Compressive Sensing with structured random matrices , 2012 .

[2]  V. Smirnov,et al.  A course of higher mathematics , 1964 .

[3]  J. Komlos,et al.  On the Size of Separating Systems and Families of Perfect Hash Functions , 1984 .

[4]  Yun Yang,et al.  Minimax-optimal nonparametric regression in high dimensions , 2014, 1401.7278.

[5]  Adam Krzyzak,et al.  A Distribution-Free Theory of Nonparametric Regression , 2002, Springer series in statistics.

[6]  Ryan O'Donnell,et al.  Learning juntas , 2003, STOC '03.

[7]  Taiji Suzuki,et al.  PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additive Model , 2012, COLT.

[8]  R. Tibshirani,et al.  A LASSO FOR HIERARCHICAL INTERACTIONS. , 2012, Annals of statistics.

[9]  J. Horowitz,et al.  VARIABLE SELECTION IN NONPARAMETRIC ADDITIVE MODELS. , 2010, Annals of statistics.

[10]  Grace Wahba An introduction to smoothing spline ANOVA models in RKHS, with examples in geographical data, medicine, atmospheric sciences and machine learning , 2003 .

[11]  Alexandre B. Tsybakov,et al.  Introduction to Nonparametric Estimation , 2008, Springer series in statistics.

[12]  R. DeVore,et al.  A Simple Proof of the Restricted Isometry Property for Random Matrices , 2008 .

[13]  Volkan Cevher,et al.  Recipes on hard thresholding methods , 2011, 2011 4th IEEE International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP).

[14]  Martin J. Wainwright,et al.  Sharp Thresholds for High-Dimensional and Noisy Sparsity Recovery Using $\ell _{1}$ -Constrained Quadratic Programming (Lasso) , 2009, IEEE Transactions on Information Theory.

[15]  I. Daubechies,et al.  Capturing Ridge Functions in High Dimensions from Point Queries , 2012 .

[16]  K. Ritter,et al.  Minimal Errors for Strong and Weak Approximation of Stochastic Differential Equations , 2008 .

[17]  Ming Yuan,et al.  Nonnegative Garrote Component Selection in Functional ANOVA models , 2007, AISTATS.

[18]  C. J. Stone,et al.  The Use of Polynomial Splines and Their Tensor Products in Multivariate Function Estimation , 1994 .

[19]  Georgios B. Giannakis,et al.  Sparse Volterra and Polynomial Regression Models: Recoverability and Estimation , 2011, IEEE Transactions on Signal Processing.

[20]  Ji Zhu,et al.  Variable Selection With the Strong Heredity Constraint and Its Oracle Property , 2010 .

[21]  Hao Helen Zhang,et al.  Component selection and smoothing in multivariate nonparametric regression , 2006, math/0702659.

[22]  Jan Vybíral,et al.  Learning Functions of Few Arbitrary Linear Parameters in High Dimensions , 2010, Found. Comput. Math..

[23]  R. DeVore,et al.  Approximation of Functions of Few Variables in High Dimensions , 2011 .

[24]  Jan Vybíral,et al.  On some aspects of approximation of ridge functions , 2014, J. Approx. Theory.

[25]  Martin J. Wainwright,et al.  Information-Theoretic Limits on Sparsity Recovery in the High-Dimensional and Noisy Setting , 2007, IEEE Transactions on Information Theory.

[26]  Ming Yuan,et al.  Sparse Recovery in Large Ensembles of Kernel Machines On-Line Learning and Bandits , 2008, COLT.

[27]  Jan Vybíral,et al.  Compressed learning of high-dimensional sparse functions , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  H. Woxniakowski Information-Based Complexity , 1988 .

[29]  Mike E. Davies,et al.  Iterative Hard Thresholding for Compressed Sensing , 2008, ArXiv.

[30]  C. R. Deboor,et al.  A practical guide to splines , 1978 .

[31]  Robert D. Nowak,et al.  Sparse interactions: Identifying high-dimensional multilinear systems via compressed sensing , 2010, 2010 48th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[32]  J. Lafferty,et al.  Sparse additive models , 2007, 0711.4555.

[33]  E. Candès,et al.  Stable signal recovery from incomplete and inaccurate measurements , 2005, math/0503066.

[34]  M. Powell,et al.  On the Estimation of Sparse Hessian Matrices , 1979 .

[35]  Volkan Cevher,et al.  Hard thresholding with norm constraints , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[36]  J. Spall Multivariate stochastic approximation using a simultaneous perturbation gradient approximation , 1992 .

[37]  Arkadi Nemirovski,et al.  Topics in Non-Parametric Statistics , 2000 .

[38]  Gareth M. James,et al.  Variable Selection Using Adaptive Nonlinear Interaction Structures in High Dimensions , 2010 .

[39]  Brian J Reich,et al.  Surface Estimation, Variable Selection, and the Nonparametric Oracle Property. , 2011, Statistica Sinica.

[40]  Thomas F. Coleman,et al.  Estimation of sparse hessian matrices and graph coloring problems , 1982, Math. Program..

[41]  A. Nilli Perfect Hashing and Probability , 1994, Combinatorics, Probability and Computing.

[42]  A. Dalalyan,et al.  Tight conditions for consistency of variable selection in the context of high dimensionality , 2011, 1106.4293.

[43]  Terence Tao,et al.  The Dantzig selector: Statistical estimation when P is much larger than n , 2005, math/0506081.

[44]  Martin Wahl Variable selection in high-dimensional additive models based on norms of projections , 2014, 1406.0052.

[45]  Yurii Nesterov,et al.  Interior-point polynomial algorithms in convex programming , 1994, Siam studies in applied mathematics.

[46]  Volkan Cevher,et al.  Combinatorial selection and least absolute shrinkage via the Clash algorithm , 2012, 2012 IEEE International Symposium on Information Theory Proceedings.

[47]  S. Geer,et al.  High-dimensional additive modeling , 2008, 0806.4115.

[48]  H. Müller,et al.  Local Polynomial Modeling and Its Applications , 1998 .

[49]  R. Caflisch Monte Carlo and quasi-Monte Carlo methods , 1998, Acta Numerica.

[50]  Holger Rauhut,et al.  A Mathematical Introduction to Compressive Sensing , 2013, Applied and Numerical Harmonic Analysis.

[51]  Andrea J. Goldsmith,et al.  Exact and Stable Covariance Estimation From Quadratic Sampling via Convex Programming , 2013, IEEE Transactions on Information Theory.

[52]  Martin J. Wainwright,et al.  Minimax-Optimal Rates For Sparse Additive Models Over Kernel Classes Via Convex Programming , 2010, J. Mach. Learn. Res..

[53]  János Körner,et al.  New Bounds for Perfect Hashing via Information Theory , 1988, Eur. J. Comb..

[54]  Chong Gu Smoothing Spline Anova Models , 2002 .

[55]  Andreas Krause,et al.  Learning Sparse Additive Models with Interactions in High Dimensions , 2016, AISTATS.

[56]  V. Koltchinskii,et al.  SPARSITY IN MULTIPLE KERNEL LEARNING , 2010, 1211.2998.

[57]  Andreas Krause,et al.  Efficient Sampling for Learning Sparse Additive Models in High Dimensions , 2014, NIPS.

[58]  David L Donoho,et al.  Compressed sensing , 2006, IEEE Transactions on Information Theory.

[59]  H. Rauhut Compressive Sensing and Structured Random Matrices , 2009 .

[60]  Volkan Cevher,et al.  Active Learning of Multi-Index Function Models , 2012, NIPS.

[61]  Katya Scheinberg,et al.  Computation of sparse low degree interpolating polynomials and their application to derivative-free optimization , 2012, Mathematical Programming.

[62]  Yu. I. Ingster,et al.  Statistical inference in compound functional models , 2012, 1208.6402.

[63]  Arnak S. Dalalyan,et al.  Tight conditions for consistent variable selection in high dimensional nonparametric regression , 2011, COLT.

[64]  P. Wojtaszczyk 1 Minimization with Noisy Data , 2012, SIAM J. Numer. Anal..

[65]  Aravind Srinivasan,et al.  Splitters and near-optimal derandomization , 1995, Proceedings of IEEE 36th Annual Foundations of Computer Science.

[66]  M. Maathuis,et al.  Estimating high-dimensional intervention effects from observational data , 2008, 0810.4214.

[67]  Mike E. Davies,et al.  Normalized Iterative Hard Thresholding: Guaranteed Stability and Performance , 2010, IEEE Journal of Selected Topics in Signal Processing.