Minimax-optimal rates for high-dimensional sparse additive models over kernel classes

Sparse additive models are families of d-variate functions that have the additive decomposition f = ∑ j∈S f ∗ j , where S is an unknown subset of cardinality s ≪ d. We consider the case where each component function f j lies in a reproducing kernel Hilbert space, and analyze an l1 kernel-based method for estimating the unknown function f . Working within a highdimensional framework that allows both the dimension d and sparsity s to increase with n, we derive rates in the L(P) and L(Pn) norms over the class Fd,s,H of sparse additive models with each univariate function f j bounded. These rates consist of two terms: a subset selection term of the order s log d n , corresponding to the difficulty of finding the unknown s-sized subset, and an estimation error term of the order s ν n, where ν 2 n is the optimal rate for estimating an univariate function within the RKHS. We complement these achievable results by deriving minimax lower bounds on the L(P) error, thereby showing the optimality of our method. Thus, we obtain optimal minimax rates for many interesting classes of sparse additive models, including polynomials, splines, finite-rank kernel classes, as well as Sobolev smoothness classes. Concurrent work by Koltchinskii and Yuan [17] analyzes the same l1-kernel-based estimator, and under an additional global boundedness condition, provides rates on the L(P) and L(Pn) of the same order as those proven here without global boundedness. We analyze an alternative estimator for globally bounded function classes, and prove that it can achieve strictly faster rates for Sobolev smoothness classes with sparsity s = Ω( √ n). Consequently, in the high-dimensional setting, the minimax rates with global boundedness conditions are strictly faster, so that the rates proven by Koltchinskii and Yuan [17] are not minimax optimal for growing sparsity.

[1]  J. Mercer Functions of Positive and Negative Type, and their Connection with the Theory of Integral Equations , 1909 .

[2]  N. Aronszajn Theory of Reproducing Kernels. , 1950 .

[3]  M. Birman,et al.  PIECEWISE-POLYNOMIAL APPROXIMATIONS OF FUNCTIONS OF THE CLASSES $ W_{p}^{\alpha}$ , 1967 .

[4]  G. Wahba,et al.  Some results on Tchebycheffian spline functions , 1971 .

[5]  Hans Triebel,et al.  Inequalities between eigenvalues, entropy numbers, and related quantities of compact operators in Banach spaces , 1980 .

[6]  H. Weinert Reproducing kernel Hilbert spaces: Applications in statistical signal processing , 1982 .

[7]  C. J. Stone,et al.  Additive Regression and Other Nonparametric Models , 1985 .

[8]  K. Alexander,et al.  Rates of growth and sample moduli for weighted empirical processes indexed by sets , 1987 .

[9]  Saburou Saitoh,et al.  Theory of Reproducing Kernels and Its Applications , 1988 .

[10]  B. Carl,et al.  Entropy, Compactness and the Approximation of Operators , 1990 .

[11]  G. Wahba Spline models for observational data , 1990 .

[12]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[13]  M. Talagrand,et al.  Probability in Banach Spaces: Isoperimetry and Processes , 1991 .

[14]  L. Breiman Better subset regression using the nonnegative garrote , 1995 .

[15]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[16]  Yuhong Yang,et al.  Information-theoretic determination of minimax rates of convergence , 1999 .

[17]  P. Massart,et al.  About the constants in Talagrand's concentration inequalities for empirical processes , 2000 .

[18]  M. Ledoux The concentration of measure phenomenon , 2001 .

[19]  S. R. Jammalamadaka,et al.  Empirical Processes in M-Estimation , 2001 .

[20]  Shahar Mendelson,et al.  Geometric Parameters of Kernel Machines , 2002, COLT.

[21]  Chong Gu Smoothing Spline Anova Models , 2002 .

[22]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[23]  Hao Helen Zhang,et al.  Component selection and smoothing in multivariate nonparametric regression , 2006, math/0702659.

[24]  Ming Yuan,et al.  Nonnegative Garrote Component Selection in Functional ANOVA models , 2007, AISTATS.

[25]  Larry A. Wasserman,et al.  SpAM: Sparse Additive Models , 2007, NIPS.

[26]  Francis R. Bach,et al.  Consistency of the group Lasso and multiple kernel learning , 2007, J. Mach. Learn. Res..

[27]  Ming Yuan,et al.  Sparse Recovery in Large Ensembles of Kernel Machines On-Line Learning and Bandits , 2008, COLT.

[28]  Martin J. Wainwright,et al.  A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers , 2009, NIPS.

[29]  S. Geer,et al.  High-dimensional additive modeling , 2008, 0806.4115.

[30]  P. Bickel,et al.  SIMULTANEOUS ANALYSIS OF LASSO AND DANTZIG SELECTOR , 2008, 0801.1095.

[31]  V. Koltchinskii,et al.  SPARSITY IN MULTIPLE KERNEL LEARNING , 2010, 1211.2998.

[32]  Martin J. Wainwright,et al.  Minimax Rates of Estimation for High-Dimensional Linear Regression Over $\ell_q$ -Balls , 2009, IEEE Transactions on Information Theory.

[33]  Martin J. Wainwright,et al.  Minimax-Optimal Rates For Sparse Additive Models Over Kernel Classes Via Convex Programming , 2010, J. Mach. Learn. Res..