Greedy algorithms for prediction

In many prediction problems, it is not uncommon that the number of variables used to construct a forecast is of the same order of magnitude as the sample size, if not larger. We then face the problem of constructing a prediction in the presence of potentially large estimation error. Control of the estimation error is either achieved by selecting variables or combining all the variables in some special way. This paper considers greedy algorithms to solve this problem. It is shown that the resulting estimators are consistent under weak conditions. In particular, the derived rates of convergence are either minimax or improve on the ones given in the literature allowing for dependence and unbounded regressors. Some versions of the algorithms provide fast solution to problems such as Lasso.

[1]  Kenneth L. Clarkson,et al.  Coresets, sparse greedy approximation, and the Frank-Wolfe algorithm , 2008, SODA '08.

[2]  Sara van de Geer,et al.  Confidence sets in sparse regression , 2012, 1209.1508.

[3]  Alessio Sancetta A Recursive Algorithm for Mixture of Densities Estimation , 2013, IEEE Transactions on Information Theory.

[4]  J. Stock,et al.  How Did Leading Indicator Forecasts Perform during the 2001 Recession? , 2003 .

[5]  I. Daubechies,et al.  An iterative thresholding algorithm for linear inverse problems with a sparsity constraint , 2003, math/0307152.

[6]  V. Temlyakov,et al.  Two Lower Estimates in Greedy Approximation , 2001 .

[7]  B. Efron Estimating the Error Rate of a Prediction Rule: Improvement on Cross-Validation , 1983 .

[8]  I. Johnstone,et al.  Minimax estimation via wavelet shrinkage , 1998 .

[9]  Cong Huang Risk of penalized least squares, greedy selection andl 1-penalization for flexible function libraries , 2008 .

[10]  Cun-Hui Zhang Nearly unbiased variable selection under minimax concave penalty , 2010, 1002.4734.

[11]  Y. C. Pati,et al.  Orthogonal matching pursuit: recursive function approximation with applications to wavelet decomposition , 1993, Proceedings of 27th Asilomar Conference on Signals, Systems and Computers.

[12]  J. M. Bates,et al.  The Combination of Forecasts , 1969 .

[13]  Paul Grigas,et al.  AdaBoost and Forward Stagewise Regression are First-Order Convex Optimization Methods , 2013, ArXiv.

[14]  Po-Ling Loh,et al.  High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity , 2011, NIPS.

[15]  A. Tsybakov,et al.  Sparsity oracle inequalities for the Lasso , 2007, 0705.3308.

[16]  A. Belloni,et al.  SPARSE MODELS AND METHODS FOR OPTIMAL INSTRUMENTS WITH AN APPLICATION TO EMINENT DOMAIN , 2012 .

[17]  Richard A. Davis,et al.  Regular variation of GARCH processes , 2002 .

[18]  Peter Buhlmann Statistical significance in high-dimensional linear models , 2012, 1202.1377.

[19]  L. Isserlis ON A FORMULA FOR THE PRODUCT-MOMENT COEFFICIENT OF ANY ORDER OF A NORMAL FREQUENCY DISTRIBUTION IN ANY NUMBER OF VARIABLES , 1918 .

[20]  R. C. Bradley Basic properties of strong mixing conditions. A survey and some open questions , 2005, math/0511078.

[21]  S. Geer On the uniform convergence of empirical norms and inner products, with application to causal inference , 2013, 1310.5523.

[22]  S. Geer,et al.  On the conditions used to prove oracle results for the Lasso , 2009, 0910.0722.

[23]  V. V. Petrov Limit Theorems of Probability Theory: Sequences of Independent Random Variables , 1995 .

[24]  Cun-Hui Zhang,et al.  Confidence intervals for low dimensional parameters in high dimensional linear models , 2011, 1110.2563.

[25]  A. Belloni,et al.  L1-Penalized Quantile Regression in High Dimensional Sparse Models , 2009, 0904.2931.

[26]  Jussi Klemelä Density estimation with stagewise optimization of the empirical risk , 2006, Machine Learning.

[27]  Vladimir N. Temlyakov,et al.  Weak greedy algorithms[*]This research was supported by National Science Foundation Grant DMS 9970326 and by ONR Grant N00014‐96‐1‐1003. , 2000, Adv. Comput. Math..

[28]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[29]  P. Massart,et al.  Gaussian model selection , 2001 .

[30]  E. Livshits,et al.  Rate of Convergence of Pure Greedy Algorithms , 2004 .

[31]  Gennady Samorodnitsky,et al.  Long Range Dependence , 2007, Found. Trends Stoch. Syst..

[32]  Vladimir Temlyakov,et al.  CAMBRIDGE MONOGRAPHS ON APPLIED AND COMPUTATIONAL MATHEMATICS , 2022 .

[33]  B. Peter,et al.  BOOSTING FOR HIGH-MULTIVARIATE RESPONSES IN HIGH-DIMENSIONAL LINEAR REGRESSION , 2006 .

[34]  Francesco Audrino,et al.  A dynamic model of expected bond returns: A functional gradient descent approach , 2006, Comput. Stat. Data Anal..

[35]  D. Nolan,et al.  DATA‐DEPENDENT ESTIMATION OF PREDICTION FUNCTIONS , 1992 .

[36]  P. Bühlmann,et al.  Splines for financial volatility , 2007 .

[37]  R. Tibshirani,et al.  Degrees of freedom in lasso problems , 2011, 1111.0653.

[38]  M. Peligrad,et al.  A MAXIMAL Lp-INEQUALITY FOR STATIONARY SEQUENCES AND ITS APPLICATIONS , 2005 .

[39]  Abdelkader Mokkadem,et al.  Propriétés de mélange des processus autorégressifs polynomiaux , 1990 .

[40]  秀俊 松井,et al.  Statistics for High-Dimensional Data: Methods, Theory and Applications , 2014 .

[41]  Ronald A. DeVore,et al.  Some remarks on greedy algorithms , 1996, Adv. Comput. Math..

[42]  P. Doukhan,et al.  A new weak dependence condition and applications to moment inequalities , 1999 .

[43]  D. Pollard Maximal inequalities via bracketing with adaptive truncation , 2002 .

[44]  P. Tseng Convergence of a Block Coordinate Descent Method for Nondifferentiable Minimization , 2001 .

[45]  Martin Jaggi,et al.  Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization , 2013, ICML.

[46]  Tong Zhang,et al.  On the Consistency of Feature Selection using Greedy Least Squares Regression , 2009, J. Mach. Learn. Res..

[47]  P. Gänssler Weak Convergence and Empirical Processes - A. W. van der Vaart; J. A. Wellner. , 1997 .

[48]  P. Bühlmann Statistical significance in high-dimensional linear models , 2013 .

[49]  Jon A. Wellner,et al.  Weak Convergence and Empirical Processes: With Applications to Statistics , 1996 .

[50]  D. Panchenko,et al.  Risk bounds for mixture density estimation , 2005 .

[51]  Alessio Sancetta A NONPARAMETRIC ESTIMATOR FOR THE COVARIANCE FUNCTION OF FUNCTIONAL DATA , 2014, Econometric Theory.

[52]  M. Peligrad,et al.  A maximal _{}-inequality for stationary sequences and its applications , 2006 .

[53]  Todd E. Clark,et al.  Forecast Combination Across Estimation Windows , 2011 .

[54]  Y. Ritov,et al.  Persistence in high-dimensional linear predictor selection and the virtue of overparametrization , 2004 .

[55]  R. Tibshirani,et al.  PATHWISE COORDINATE OPTIMIZATION , 2007, 0708.1485.

[56]  J. Stock,et al.  Combination forecasts of output growth in a seven-country data set , 2004 .

[57]  P. Bühlmann,et al.  Boosting with the L2-loss: regression and classification , 2001 .

[58]  Terence Tao,et al.  The Dantzig selector: Statistical estimation when P is much larger than n , 2005, math/0506081.

[59]  Lie Wang,et al.  Orthogonal Matching Pursuit for Sparse Signal Recovery With Noise , 2011, IEEE Transactions on Information Theory.

[60]  Jérôme Dedecker,et al.  A new covariance inequality and applications , 2003 .

[61]  P. Bartlett,et al.  ℓ1-regularized linear regression: persistence and oracle inequalities , 2012 .

[62]  B. Peter BOOSTING FOR HIGH-DIMENSIONAL LINEAR MODELS , 2006 .

[63]  Clifford M. Hurvich,et al.  Smoothing parameter selection in nonparametric regression using an improved Akaike information criterion , 1998 .

[64]  L. Jones A Simple Lemma on Greedy Approximation in Hilbert Space and Convergence Rates for Projection Pursuit Regression and Neural Network Training , 1992 .

[65]  Peter J. Bickel,et al.  A new mixing notion and functional central limit theorems for a sieve bootstrap in time series , 1999 .

[66]  S. Geer HIGH-DIMENSIONAL GENERALIZED LINEAR MODELS AND THE LASSO , 2008, 0804.0703.

[67]  P. Bühlmann Sieve bootstrap for time series , 1997 .

[68]  Jianming Ye On Measuring and Correcting the Effects of Data Mining and Model Selection , 1998 .

[69]  M. A. Arcones,et al.  Central limit theorems for empirical andU-processes of stationary mixing sequences , 1994 .

[70]  Adam Krzyzak,et al.  A Distribution-Free Theory of Nonparametric Regression , 2002, Springer series in statistics.

[71]  Edmond Chow,et al.  A cross-validatory method for dependent data , 1994 .

[72]  A. Belloni,et al.  L1-Penalised quantile regression in high-dimensional sparse models , 2009 .

[73]  J. Stock,et al.  A Comparison of Linear and Nonlinear Univariate Models for Forecasting Macroeconomic Time Series , 1998 .

[74]  Alessio Sancetta Bootstrap model selection for possibly dependent and heterogeneous data , 2010 .

[75]  A. Barron,et al.  Approximation and learning by greedy algorithms , 2008, 0803.1718.

[76]  Davide Pettenuzzo,et al.  Forecasting Time Series Subject to Multiple Structural Breaks , 2004, SSRN Electronic Journal.

[77]  P. Massart,et al.  Invariance principles for absolutely regular empirical processes , 1995 .

[78]  A. Mokkadem Mixing properties of ARMA processes , 1988 .

[79]  Xiaotong Shen,et al.  Sieve extremum estimates for weakly dependent data , 1998 .

[80]  S. Geer,et al.  On asymptotically optimal confidence regions and tests for high-dimensional models , 2013, 1303.0518.

[81]  R. Tibshirani,et al.  On the “degrees of freedom” of the lasso , 2007, 0712.0881.

[82]  Philip Wolfe,et al.  An algorithm for quadratic programming , 1956 .

[83]  E. Greenshtein Best subset selection, persistence in high-dimensional statistical learning and optimization under l1 constraint , 2006, math/0702684.

[84]  A. Tsybakov,et al.  Aggregation for Gaussian regression , 2007, 0710.3654.

[85]  Donald W. K. Andrews,et al.  Non-strong mixing autoregressive processes , 1984, Journal of Applied Probability.

[86]  R. C. Bradley Basic Properties of Strong Mixing Conditions , 1985 .

[87]  E. Rio,et al.  Théorie asymptotique de processus aléatoires faiblement dépendants , 2000 .

[88]  Sara van de Geer,et al.  Statistics for High-Dimensional Data , 2011 .

[89]  Andrew R. Barron,et al.  Universal approximation bounds for superpositions of a sigmoidal function , 1993, IEEE Trans. Inf. Theory.

[90]  Alexandre B. Tsybakov,et al.  Optimal Rates of Aggregation , 2003, COLT.

[91]  Stéphane Mallat,et al.  Matching pursuits with time-frequency dictionaries , 1993, IEEE Trans. Signal Process..

[92]  Alexandre B. Tsybakov,et al.  Introduction to Nonparametric Estimation , 2008, Springer series in statistics.

[93]  Andrew R. Barron,et al.  Mixture Density Estimation , 1999, NIPS.

[94]  P. Bühlmann,et al.  Volatility estimation with functional gradient descent for very high-dimensional financial time series , 2003 .