Hierarchical Total Variations and Doubly Penalized ANOVA Modeling for Multivariate Nonparametric Regression

Abstract For multivariate nonparametric regression, functional analysis of variance (ANOVA) modeling aims to capture the relationship between a response and covariates by decomposing the unknown function into various components, representing main effects, two-way interactions, etc. Such an approach has been pursued explicitly in smoothing spline ANOVA modeling and implicitly in various greedy methods such as MARS. We develop a new method for functional ANOVA modeling, based on doubly penalized estimation using total-variation and empirical-norm penalties, to achieve sparse selection of component functions and their basis functions. For this purpose, we formulate a new class of hierarchical total variations, which measures total variations at different levels including main effects and multi-way interactions, possibly after some order of differentiation. Furthermore, we derive suitable basis functions for multivariate splines such that the hierarchical total variation can be represented as a regular Lasso penalty, and hence we extend a previous backfitting algorithm to handle doubly penalized estimation for ANOVA modeling. We present extensive numerical experiments on simulations and real data to compare our method with existing methods including MARS, tree boosting, and random forest. The results are very encouraging and demonstrate notable gains from our method in prediction or classification accuracy and simplicity of the fitted functions. Supplementary materials for this article are available online.

[1]  Marc Teboulle,et al.  A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems , 2009, SIAM J. Imaging Sci..

[2]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[3]  Stephen P. Boyd,et al.  1 Trend Filtering , 2009, SIAM Rev..

[4]  G. Wahba,et al.  Smoothing spline ANOVA for exponential families, with application to the Wisconsin Epidemiological Study of Diabetic Retinopathy : the 1994 Neyman Memorial Lecture , 1995 .

[5]  S. Wood Generalized Additive Models: An Introduction with R , 2006 .

[6]  J. Friedman Multivariate adaptive regression splines , 1990 .

[7]  Brandon M. Greenwell,et al.  Generalized Boosted Regression Models [R package gbm version 2.1.8] , 2020 .

[8]  Laurent Condat,et al.  Discrete Total Variation: New Definition and Minimization , 2017, SIAM J. Imaging Sci..

[9]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[10]  S. Geer,et al.  Locally adaptive regression splines , 1997 .

[11]  R. Tibshirani,et al.  Generalized Additive Models , 1991 .

[12]  Noah Simon,et al.  Convex Regression with Interpretable Sharp Partitions , 2016, J. Mach. Learn. Res..

[13]  C. J. Stone,et al.  The Dimensionality Reduction Principle for Generalized Additive Models , 1986 .

[14]  B. Lindsay,et al.  Monotonicity of quadratic-approximation algorithms , 1988 .

[15]  Larry A. Wasserman,et al.  SpAM: Sparse Additive Models , 2007, NIPS.

[16]  Ashley Petersen,et al.  Fused Lasso Additive Model , 2014, Journal of computational and graphical statistics : a joint publication of American Statistical Association, Institute of Mathematical Statistics, Interface Foundation of North America.

[17]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[18]  L. Rudin,et al.  Nonlinear total variation based noise removal algorithms , 1992 .

[19]  J. Freidman,et al.  Multivariate adaptive regression splines , 1991 .

[20]  V. Koltchinskii,et al.  SPARSITY IN MULTIPLE KERNEL LEARNING , 2010, 1211.2998.

[22]  R. Tibshirani,et al.  Additive models with trend filtering , 2017, The Annals of Statistics.

[23]  L. Ambrosio,et al.  Functions of Bounded Variation and Free Discontinuity Problems , 2000 .

[24]  M. R. Osborne,et al.  A new approach to variable selection in least squares problems , 2000 .

[25]  I. Johnstone,et al.  Ideal spatial adaptation by wavelet shrinkage , 1994 .

[26]  R. Tibshirani Adaptive piecewise polynomial estimation via trend filtering , 2013, 1304.2986.

[27]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[28]  Gareth M. James,et al.  Variable Selection Using Adaptive Nonlinear Interaction Structures in High Dimensions , 2010 .

[29]  Ji Zhu,et al.  Variable Selection With the Strong Heredity Constraint and Its Oracle Property , 2010 .

[30]  Martin J. Wainwright,et al.  Minimax-Optimal Rates For Sparse Additive Models Over Kernel Classes Via Convex Programming , 2010, J. Mach. Learn. Res..

[31]  Chong Gu Smoothing Spline Anova Models , 2002 .

[32]  Alan Y. Chiang,et al.  Generalized Additive Models: An Introduction With R , 2007, Technometrics.

[33]  S. Geer,et al.  High-dimensional additive modeling , 2008, 0806.4115.

[34]  Lawrence Carin,et al.  Sparse multinomial logistic regression: fast algorithms and generalization bounds , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35]  Stephen P. Boyd,et al.  An Interior-Point Method for Large-Scale $\ell_1$-Regularized Least Squares , 2007, IEEE Journal of Selected Topics in Signal Processing.

[36]  Cun-Hui Zhang,et al.  Doubly penalized estimation in additive regression with high-dimensional data , 2019, The Annals of Statistics.

[37]  K. Lange,et al.  Coordinate descent algorithms for lasso penalized regression , 2008, 0803.3876.

[38]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[39]  Stephen J. Wright Coordinate descent algorithms , 2015, Mathematical Programming.

[40]  Hao Helen Zhang,et al.  Component selection and smoothing in multivariate nonparametric regression , 2006, math/0702659.

[41]  V. S. Shankar Sriram,et al.  Hypergraph Based Feature Selection Technique for Medical Diagnosis , 2016, Journal of Medical Systems.