Contributions à l'apprentissage statistique : estimation de densité, agrégation d'experts et forêts aléatoires. (Contributions to statistical learning : density estimation, expert aggregation and random forests)

L’apprentissage statistique fournit un cadre aux problemes de prediction, ou l’on cherche a predire des quantites inconnues a partir d’exemples.La premiere partie de cette these porte sur les methodes de Forets aleatoires, une famille d'algorithmes couramment utilises en pratique, mais dont l'etude theorique s'avere delicate. Notre principale contribution est l'analyse precise d'une variante stylisee, les forets de Mondrian, pour lesquelles nous etablissons des vitesses de convergence non parametriques minimax ainsi qu'un avantage des forets sur les arbres. Nous etudions egalement une variante "en ligne" des forets de Mondrian.La seconde partie est dediee a l'agregation d'experts, ou il s'agit de combiner plusieurs sources de predictions (experts) afin de predire aussi bien que la meilleure d'entre elles. Nous analysons l'algorithme classique d'agregation a poids exponentiels dans le cas stochastique, ou il exhibe une certaine adaptativite a la difficulte du probleme. Nous etudions egalement une variante du probleme avec une classe croissante d'experts.La troisieme partie porte sur des problemes de regression et d'estimation de densite. Notre premiere contribution principale est une analyse minimax detaillee de la prediction lineaire avec design aleatoire, en fonction de la loi des variables predictives; nos bornes superieures reposent sur un controle de la queue inferieure de matrices de covariance empiriques. Notre seconde contribution principale est l'introduction d'une procedure generale pour l'estimation de densite avec perte logarithmique, qui admet des bornes optimales d'exces de risque ne se degradant pas dans le cas mal specifie. Dans le cas de la regression logistique, cette procedure admet une forme simple et atteint des vitesses de convergence rapides inaccessibles aux estimateurs de type plug-in.

[1]  A. Tsybakov,et al.  Aggregation for Gaussian regression , 2007, 0710.3654.

[2]  Francesco Orabona,et al.  Improved Strongly Adaptive Online Learning using Coin Betting , 2016, AISTATS.

[3]  Anja Vogler,et al.  An Introduction to Multivariate Statistical Analysis , 2004 .

[4]  Tin Kam Ho,et al.  The Random Subspace Method for Constructing Decision Forests , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[5]  Horst Bischof,et al.  On-line Random Forests , 2009, 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops.

[6]  Geoff Hulten,et al.  Mining high-speed data streams , 2000, KDD '00.

[7]  Wouter M. Koolen,et al.  Combining Adversarial Guarantees and Stochastic Fast Rates in Online Learning , 2016, NIPS.

[8]  Robert B. Gramacy,et al.  Dynamic Trees for Learning and Design , 2009, 0912.1586.

[9]  Joel A. Tropp,et al.  An Introduction to Matrix Concentration Inequalities , 2015, Found. Trends Mach. Learn..

[10]  Michael W. Mahoney,et al.  A Statistical Perspective on Randomized Sketching for Ordinary Least-Squares , 2014, J. Mach. Learn. Res..

[11]  V. Marčenko,et al.  DISTRIBUTION OF EIGENVALUES FOR SOME SETS OF RANDOM MATRICES , 1967 .

[12]  M. Rudelson,et al.  The Littlewood-Offord problem and invertibility of random matrices , 2007, math/0703503.

[13]  Massimiliano Pontil,et al.  Online Gradient Descent Learning Algorithms , 2008, Found. Comput. Math..

[14]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[15]  Paul E. Utgoff,et al.  Incremental Induction of Decision Trees , 1989, Machine Learning.

[16]  Wouter M. Koolen,et al.  Universal Codes From Switching Strategies , 2013, IEEE Transactions on Information Theory.

[17]  T. Poggio,et al.  STABILITY RESULTS IN LEARNING THEORY , 2005 .

[18]  Vladimir Vovk,et al.  Aggregating strategies , 1990, COLT '90.

[19]  Wouter M. Koolen,et al.  Putting Bayes to sleep , 2012, NIPS.

[20]  Mark D. Reid,et al.  Fast rates in statistical and online learning , 2015, J. Mach. Learn. Res..

[21]  O. Catoni The Mixture Approach to Universal Model Selection , 1997 .

[22]  Stefan Wager,et al.  High-Dimensional Asymptotics of Prediction: Ridge Regression and Classification , 2015, 1507.03003.

[23]  Luc Devroye,et al.  Lower bounds in pattern recognition and learning , 1995, Pattern Recognit..

[24]  Yali Amit,et al.  Shape Quantization and Recognition with Randomized Trees , 1997, Neural Computation.

[25]  J. Hartigan The maximum likelihood prior , 1998 .

[26]  P. Massart,et al.  Rates of convergence for minimum contrast estimators , 1993 .

[27]  T. Tao,et al.  From the Littlewood-Offord problem to the Circular Law: universality of the spectral distribution of random matrices , 2008, 0810.2994.

[28]  Jason M. Klusowski Complete Analysis of a Random Forest Model , 2018, ArXiv.

[29]  Arnaud Guyader,et al.  On the Rate of Convergence of the Bagged Nearest Neighbor Estimate , 2010, J. Mach. Learn. Res..

[30]  Alessandro Lazaric,et al.  Exploiting easy data in online optimization , 2014, NIPS.

[31]  F. Komaki On asymptotic properties of predictive distributions , 1996 .

[32]  J. W. Silverstein,et al.  Spectral Analysis of Large Dimensional Random Matrices , 2009 .

[33]  Claudio Gentile,et al.  Adaptive and Self-Confident On-Line Learning Algorithms , 2000, J. Comput. Syst. Sci..

[34]  Boris Ryabko,et al.  Prediction of random sequences and universal coding , 2015 .

[35]  Shai Shalev-Shwartz,et al.  Average Stability is Invariant to Data Preconditioning. Implications to Exp-concave Empirical Risk Minimization , 2016, J. Mach. Learn. Res..

[36]  André Elisseeff,et al.  Stability and Generalization , 2002, J. Mach. Learn. Res..

[37]  Vladimir Vovk,et al.  Derandomizing Stochastic Prediction Strategies , 1997, COLT '97.

[38]  Alexandre B. Tsybakov,et al.  Introduction to Nonparametric Estimation , 2008, Springer series in statistics.

[39]  Lorenzo Rosasco,et al.  Model Selection for Regularized Least-Squares Algorithm in Learning Theory , 2005, Found. Comput. Math..

[40]  C. J. Stone,et al.  Optimal Rates of Convergence for Nonparametric Estimators , 1980 .

[41]  P. Massart,et al.  Concentration inequalities and model selection , 2007 .

[42]  Gilles Stoltz,et al.  Fano's inequality for random variables , 2017, Statistical Science.

[43]  Eric R. Ziegel,et al.  Generalized Linear Models , 2002, Technometrics.

[44]  H. Robbins A Stochastic Approximation Method , 1951 .

[45]  R. Keener Theoretical Statistics: Topics for a Core Course , 2010 .

[46]  Elad Hazan,et al.  Logarithmic regret algorithms for online convex optimization , 2006, Machine Learning.

[47]  Mark W. Schmidt,et al.  A simpler approach to obtaining an O(1/t) convergence rate for the projected stochastic subgradient method , 2012, ArXiv.

[48]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[49]  Feng Liang,et al.  Exact minimax strategies for predictive density estimation, data compression, and model selection , 2002, IEEE Transactions on Information Theory.

[50]  Vladimir Vovk,et al.  A game of prediction with expert advice , 1995, COLT '95.

[51]  M. Talagrand New concentration inequalities in product spaces , 1996 .

[52]  Dana Ron,et al.  Algorithmic Stability and Sanity-Check Bounds for Leave-One-Out Cross-Validation , 1997, Neural Computation.

[53]  László Györfi,et al.  A simple randomized algorithm for sequential prediction of ergodic time series , 1999, IEEE Trans. Inf. Theory.

[54]  Wouter M. Koolen,et al.  Follow the leader if you can, hedge if you must , 2013, J. Mach. Learn. Res..

[55]  A Tikhonov,et al.  Solution of Incorrectly Formulated Problems and the Regularization Method , 1963 .

[56]  Dmitrii Ostrovskii,et al.  Finite-sample Analysis of M-estimators using Self-concordance , 2018, 1810.06838.

[57]  R. Dudley The Sizes of Compact Subsets of Hilbert Space and Continuity of Gaussian Processes , 1967 .

[58]  Claudio Gentile,et al.  Regret Minimization for Branching Experts , 2022 .

[59]  R. Vershynin,et al.  Covariance estimation for distributions with 2+ε moments , 2011, 1106.2775.

[60]  Daniel M. Roy,et al.  Bayesian Models of Graphs, Arrays and Other Exchangeable Random Structures , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[61]  Ambuj Tewari,et al.  On the Complexity of Linear Prediction: Risk Bounds, Margin Bounds, and Regularization , 2008, NIPS.

[62]  Y. Yin Limiting spectral distribution for a class of random matrices , 1986 .

[63]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[64]  Ohad Shamir,et al.  The sample complexity of learning linear predictors with the squared loss , 2014, J. Mach. Learn. Res..

[65]  M. Rudelson,et al.  Small Ball Probabilities for Linear Images of High-Dimensional Distributions , 2014, 1402.4492.

[66]  Yun Yang,et al.  Bayesian regression tree ensembles that adapt to smoothness and sparsity , 2017, Journal of the Royal Statistical Society: Series B (Statistical Methodology).

[67]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[68]  M. Ledoux The concentration of measure phenomenon , 2001 .

[69]  J. A. Díaz-García,et al.  SENSITIVITY ANALYSIS IN LINEAR REGRESSION , 2022 .

[70]  Robert E. Schapire,et al.  Predicting Nearly As Well As the Best Pruning of a Decision Tree , 1995, COLT '95.

[71]  Roman Vershynin,et al.  Introduction to the non-asymptotic analysis of random matrices , 2010, Compressed Sensing.

[72]  Tong Zhang From ɛ-entropy to KL-entropy: Analysis of minimum information complexity density estimation , 2006, math/0702653.

[73]  E. Candès,et al.  The phase transition for the existence of the maximum likelihood estimate in high-dimensional logistic regression , 2018, The Annals of Statistics.

[74]  Felipe Cucker,et al.  On the mathematical foundations of learning , 2001 .

[75]  Donna L. Mohr,et al.  Multiple Regression , 2002, Encyclopedia of Autism Spectrum Disorders.

[76]  Gilles Louppe,et al.  Understanding Random Forests: From Theory to Practice , 2014, 1407.7502.

[77]  Seshadhri Comandur,et al.  Efficient learning algorithms for changing environments , 2009, ICML '09.

[78]  Sylvain Arlot TECHNICAL APPENDIX TO "V -FOLD CROSS-VALIDATION IMPROVED: V -FOLD PENALIZATION , 2008, 0802.0566.

[79]  Neri Merhav,et al.  Universal Prediction , 1998, IEEE Trans. Inf. Theory.

[80]  Z. Bai,et al.  Limit of the smallest eigenvalue of a large dimensional sample covariance matrix , 1993 .

[81]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[82]  L. Devroye Necessary and sufficient conditions for the pointwise convergence of nearest neighbor regression function estimates , 1982 .

[83]  David A. McAllester PAC-Bayesian model averaging , 1999, COLT '99.

[84]  S. Athey,et al.  Generalized random forests , 2016, The Annals of Statistics.

[85]  Luc Devroye,et al.  Distribution-free performance bounds for potential function rules , 1979, IEEE Trans. Inf. Theory.

[86]  Ambuj Tewari,et al.  Smoothness, Low Noise and Fast Rates , 2010, NIPS.

[87]  Antonio Criminisi,et al.  Decision Forests: A Unified Framework for Classification, Regression, Density Estimation, Manifold Learning and Semi-Supervised Learning , 2012, Found. Trends Comput. Graph. Vis..

[88]  Vladimir Vovk,et al.  Prediction with Expert Evaluators' Advice , 2009, ALT.

[89]  Jayanta K. Ghosh,et al.  Higher Order Asymptotics , 1994 .

[90]  W. Wong,et al.  Probability inequalities for likelihood ratios and convergence rates of sieve MLEs , 1995 .

[91]  Adam Krzyzak,et al.  A Distribution-Free Theory of Nonparametric Regression , 2002, Springer series in statistics.

[92]  Erwan Scornet,et al.  Random Forests and Kernel Methods , 2015, IEEE Transactions on Information Theory.

[93]  P. Yaskov Sharp lower bounds on the least singular value of a random matrix without the fourth moment condition , 2015 .

[94]  Cosma Rohilla Shalizi,et al.  Adapting to Non-stationarity with Growing Expert Ensembles , 2011, ArXiv.

[95]  Kfir Y. Levy,et al.  Fast Rates for Exp-concave Empirical Risk Minimization , 2015, NIPS.

[96]  Tor Lattimore,et al.  Following the Leader and Fast Rates in Online Linear Prediction: Curved Constraint Sets and Other Regularities , 2017, J. Mach. Learn. Res..

[97]  M. Rudelson,et al.  The smallest singular value of a random rectangular matrix , 2008, 0802.3956.

[98]  D. Ruppert,et al.  Efficient Estimations from a Slowly Convergent Robbins-Monro Process , 1988 .

[99]  Andrew R. Barron,et al.  Asymptotic minimax regret for data compression, gambling, and prediction , 1997, IEEE Trans. Inf. Theory.

[100]  Nick Littlestone,et al.  From on-line to batch learning , 1989, COLT '89.

[101]  Adele Cutler,et al.  PERT – Perfect Random Tree Ensembles , 2001 .

[102]  Soumendu Sundar Mukherjee,et al.  Weak convergence and empirical processes , 2019 .

[103]  Eric Moulines,et al.  Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning , 2011, NIPS.

[104]  C. J. Stone,et al.  Optimal Global Rates of Convergence for Nonparametric Regression , 1982 .

[105]  Tong Zhang,et al.  Information-theoretic upper and lower bounds for statistical estimation , 2006, IEEE Transactions on Information Theory.

[106]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[107]  Leo Breiman,et al.  Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author) , 2001 .

[108]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[109]  P. J. Huber Robust Regression: Asymptotics, Conjectures and Monte Carlo , 1973 .

[110]  Thomas G. Dietterich An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization , 2000, Machine Learning.

[111]  Yuhong Yang,et al.  Minimax Nonparametric Classification—Part I: Rates of Convergence , 1998 .

[112]  N. Merhav,et al.  Low complexity sequential lossless coding for piecewise stationary memoryless sources , 1998, Proceedings. 1998 IEEE International Symposium on Information Theory (Cat. No.98CH36252).

[113]  Noureddine El Karoui,et al.  Geometric sensitivity of random matrix results: consequences for shrinkage estimators of covariance and related statistical methods , 2011, 1105.1404.

[114]  Peter L. Bartlett,et al.  Exchangeability Characterizes Optimality of Sequential Normalized Maximum Likelihood and Bayesian Prediction , 2012, IEEE Transactions on Information Theory.

[115]  A. Tsybakov,et al.  Optimal aggregation of classifiers in statistical learning , 2003 .

[116]  Nathan Srebro,et al.  Fast Rates for Regularized Objectives , 2008, NIPS.

[117]  T. Tao,et al.  Inverse Littlewood-Offord theorems and the condition number of random discrete matrices , 2005, math/0511215.

[118]  Jean-Yves Audibert,et al.  Linear regression through PAC-Bayesian truncation , 2010, 1010.0072.

[119]  Vladimir Koltchinskii,et al.  Rademacher penalties and structural risk minimization , 2001, IEEE Trans. Inf. Theory.

[120]  Haipeng Luo,et al.  Achieving All with No Parameters: AdaNormalHedge , 2015, COLT.

[121]  Rong Jin,et al.  Lower and Upper Bounds on the Generalization of Stochastic Exponentially Concave Optimization , 2015, COLT.

[122]  Adrien-Marie Legendre,et al.  Nouvelles méthodes pour la détermination des orbites des comètes , 1970 .

[123]  Stéphan Clémençon,et al.  Ranking forests , 2013, J. Mach. Learn. Res..

[124]  Misha Denil,et al.  Consistency of Online Random Forests , 2013, ICML.

[125]  Yee Whye Teh,et al.  The Mondrian Process , 2008, NIPS.

[126]  R. Welsch,et al.  The Hat Matrix in Regression and ANOVA , 1978 .

[127]  Shahar Mendelson,et al.  Mean Estimation and Regression Under Heavy-Tailed Distributions: A Survey , 2019, Found. Comput. Math..

[128]  Ronald L. Rivest,et al.  Constructing Optimal Binary Decision Trees is NP-Complete , 1976, Inf. Process. Lett..

[129]  R. Samworth Optimal weighted nearest neighbour classifiers , 2011, 1101.5783.

[130]  Yuhong Yang,et al.  Information-theoretic determination of minimax rates of convergence , 1999 .

[131]  Roman Vershynin,et al.  High-Dimensional Probability , 2018 .

[132]  Daniel J. Hsu,et al.  Loss Minimization and Parameter Estimation with Heavy Tails , 2013, J. Mach. Learn. Res..

[133]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[134]  Aleksandrs Slivkins,et al.  One Practical Algorithm for Both Stochastic and Adversarial Bandits , 2014, ICML.

[135]  K. Wachter The Strong Limits of Random Matrix Spectra for Sample Matrices of Independent Elements , 1978 .

[136]  Anant Sahai,et al.  Harmless interpolation of noisy data in regression , 2019, 2019 IEEE International Symposium on Information Theory (ISIT).

[137]  S. Mendelson,et al.  Learning subgaussian classes : Upper and minimax bounds , 2013, 1305.4825.

[138]  D. Blackwell An analog of the minimax theorem for vector payoffs. , 1956 .

[139]  G. Wahba Spline models for observational data , 1990 .

[140]  Sham M. Kakade,et al.  Online Bounds for Bayesian Algorithms , 2004, NIPS.

[141]  P. Massart,et al.  Risk bounds for statistical learning , 2007, math/0702683.

[142]  Yee Whye Teh,et al.  Mondrian Forests: Efficient Online Random Forests , 2014, NIPS.

[143]  R. Z. Khasʹminskiĭ,et al.  Statistical estimation : asymptotic theory , 1981 .

[144]  Arkadi Nemirovski,et al.  Topics in Non-Parametric Statistics , 2000 .

[145]  Francis R. Bach,et al.  Adaptivity of averaged stochastic gradient descent to local strong convexity for logistic regression , 2013, J. Mach. Learn. Res..

[146]  Yishay Mansour,et al.  Improved second-order bounds for prediction with expert advice , 2006, Machine Learning.

[147]  Stéphane Gaïffas,et al.  On the optimality of the Hedge algorithm in the stochastic regime , 2018, J. Mach. Learn. Res..

[148]  Jean-Yves Audibert,et al.  Robust linear least squares regression , 2010, 1010.0074.

[149]  Vianney Perchet,et al.  ONLINE LEARNING AND GAME THEORY. A QUICK OVERVIEW WITH RECENT RESULTS AND APPLICATIONS , 2015 .

[150]  Kazuoki Azuma WEIGHTED SUMS OF CERTAIN DEPENDENT RANDOM VARIABLES , 1967 .

[151]  Tengyuan Liang,et al.  Just Interpolate: Kernel "Ridgeless" Regression Can Generalize , 2018, The Annals of Statistics.

[152]  X. Fernique Regularite des trajectoires des fonctions aleatoires gaussiennes , 1975 .

[153]  Marina Daecher Open Problems In Communication And Computation , 2016 .

[154]  D. Freedman,et al.  How Many Variables Should Be Entered in a Regression Equation , 1983 .

[155]  Jorma Rissanen,et al.  The Minimum Description Length Principle in Coding and Modeling , 1998, IEEE Trans. Inf. Theory.

[156]  Mark Herbster,et al.  Tracking the Best Expert , 1995, Machine Learning.

[157]  Andrew M. Saxe,et al.  High-dimensional dynamics of generalization error in neural networks , 2017, Neural Networks.

[158]  Daniel M. Roy Computability, inference and modeling in probabilistic programming , 2011 .

[159]  Roberto Imbuzeiro Oliveira,et al.  The lower tail of random quadratic forms with applications to ordinary least squares , 2013, ArXiv.

[160]  S. Boucheron,et al.  Theory of classification : a survey of some recent advances , 2005 .

[161]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[162]  Luc Devroye,et al.  Distribution-free inequalities for the deleted and holdout error estimates , 1979, IEEE Trans. Inf. Theory.

[163]  Nicolò Cesa-Bianchi,et al.  Mirror Descent Meets Fixed Share (and feels no regret) , 2012, NIPS.

[164]  Rory A. Fisher,et al.  Theory of Statistical Estimation , 1925, Mathematical Proceedings of the Cambridge Philosophical Society.

[165]  Yee Whye Teh,et al.  Mondrian Forests for Large-Scale Regression when Uncertainty Matters , 2015, AISTATS.

[166]  Peter Auer,et al.  An algorithm with nearly optimal pseudo-regret for both stochastic and adversarial bandits , 2016, COLT.

[167]  Malay Ghosh,et al.  Nonsubjective priors via predictive relative entropy regret , 2006 .

[168]  Nicolai Meinshausen,et al.  Quantile Regression Forests , 2006, J. Mach. Learn. Res..

[169]  Scott McQuade,et al.  Global Climate Model Tracking Using Geospatial Neighborhoods , 2012, AAAI.

[170]  E. Candès,et al.  A modern maximum-likelihood theory for high-dimensional logistic regression , 2018, Proceedings of the National Academy of Sciences.

[171]  Misha Denil,et al.  Narrowing the Gap: Random Forests In Theory and In Practice , 2013, ICML.

[172]  Noureddine El Karoui,et al.  Asymptotic behavior of unregularized and ridge-regularized high-dimensional robust regression estimators : rigorous results , 2013, 1311.2445.

[173]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[174]  Wouter M. Koolen,et al.  Adaptive Hedge , 2011, NIPS.

[175]  V. Vu,et al.  Small Ball Probability, Inverse Theorems, and Applications , 2012, 1301.0019.

[176]  Peter Grünwald,et al.  A Tight Excess Risk Bound via a Unified PAC-Bayesian-Rademacher-Shtarkov-MDL Complexity , 2017, ALT.

[177]  Abraham Wald,et al.  Statistical Decision Functions , 1951 .

[178]  Yuhong Yang Mixing Strategies for Density Estimation , 2000 .

[179]  L. Breiman SOME INFINITY THEORY FOR PREDICTOR ENSEMBLES , 2000 .

[180]  Sanjoy Dasgupta,et al.  Which Spatial Partition Trees are Adaptive to Intrinsic Dimension? , 2009, UAI.

[181]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[182]  Jorma Rissanen,et al.  Fisher information and stochastic complexity , 1996, IEEE Trans. Inf. Theory.

[183]  B. Ripley,et al.  Robust Statistics , 2018, Encyclopedia of Mathematical Geosciences.

[184]  E. Giné,et al.  Some Limit Theorems for Empirical Processes , 1984 .

[185]  Elad Hazan,et al.  Logistic Regression: Tight Bounds for Stochastic and Online Optimization , 2014, COLT.

[186]  Manfred K. Warmuth,et al.  Tracking a Small Set of Experts by Mixing Past Posteriors , 2003, J. Mach. Learn. Res..

[187]  M. Talagrand,et al.  Probability in Banach Spaces: Isoperimetry and Processes , 1991 .

[188]  Francis R. Bach,et al.  Self-concordant analysis for logistic regression , 2009, ArXiv.

[189]  Gábor Lugosi,et al.  An Improved Parametrization and Analysis of the EXP3++ Algorithm for Stochastic and Adversarial Bandits , 2017, COLT.

[190]  Wojciech Kotlowski,et al.  Maximum Likelihood vs. Sequential Normalized Maximum Likelihood in On-line Density Estimation , 2011, COLT.

[191]  Yuhong Yang,et al.  An Asymptotic Property of Model Selection Criteria , 1998, IEEE Trans. Inf. Theory.

[192]  Charles R. Johnson,et al.  Matrix analysis , 1985, Statistical Inference for Engineers and Data Scientists.

[193]  François Laviolette,et al.  PAC-Bayesian learning of linear classifiers , 2009, ICML '09.

[194]  Odalric-Ambrym Maillard,et al.  Efficient tracking of a growing number of experts , 2017, ALT.

[195]  Karthik Sridharan,et al.  Sequential Probability Assignment with Binary Alphabets and Large Classes of Experts , 2015, ArXiv.

[196]  J. Hájek Local asymptotic minimax and admissibility in estimation , 1972 .

[197]  E. L. Lehmann,et al.  Theory of point estimation , 1950 .

[198]  E. Mammen,et al.  Smooth Discrimination Analysis , 1999 .

[199]  M. Rudelson Random Vectors in the Isotropic Position , 1996, math/9608208.

[200]  Nishant Mehta,et al.  Fast rates with high probability in exp-concave statistical learning , 2016, AISTATS.

[201]  O. Catoni PAC-BAYESIAN SUPERVISED CLASSIFICATION: The Thermodynamics of Statistical Learning , 2007, 0712.0248.

[202]  Claire Monteleoni,et al.  Tracking climate models , 2011, CIDU.

[203]  Antonia Maria Tulino,et al.  Random Matrix Theory and Wireless Communications , 2004, Found. Trends Commun. Inf. Theory.

[204]  Frans M. J. Willems,et al.  The context-tree weighting method: basic properties , 1995, IEEE Trans. Inf. Theory.

[205]  V. Koltchinskii Local Rademacher complexities and oracle inequalities in risk minimization , 2006, 0708.0083.

[206]  A. Caponnetto,et al.  Optimal Rates for the Regularized Least-Squares Algorithm , 2007, Found. Comput. Math..

[207]  Eric Moulines,et al.  Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n) , 2013, NIPS.

[208]  Manfred K. Warmuth,et al.  The Last-Step Minimax Algorithm , 2000, ALT.

[209]  P. Yaskov Lower bounds on the smallest eigenvalue of a sample covariance matrix. , 2014, 1409.6188.

[210]  John Shawe-Taylor,et al.  PAC-Bayes & Margins , 2002, NIPS.

[211]  Pierre Priouret,et al.  Adaptive Algorithms and Stochastic Approximations , 1990, Applications of Mathematics.

[212]  Feng Liang,et al.  Improved minimax predictive densities under Kullback-Leibler loss , 2006 .

[213]  Nicolas Macris,et al.  Optimal errors and phase transitions in high-dimensional generalized linear models , 2017, Proceedings of the National Academy of Sciences.

[214]  V. Koltchinskii,et al.  Bounding the smallest singular value of a random matrix without concentration , 2013, 1312.3580.

[215]  J. Aitchison Goodness of prediction fit , 1975 .

[216]  R. Adamczak,et al.  Quantitative estimates of the convergence of the empirical covariance matrix in log-concave ensembles , 2009, 0903.2323.

[217]  Karthik Sridharan,et al.  Learning with Square Loss: Localization through Offset Rademacher Complexity , 2015, COLT.

[218]  V. Vovk Competitive On‐line Statistics , 2001 .

[219]  Stefan Wager,et al.  Adaptive Concentration of Regression Trees, with Application to Random Forests , 2015 .

[220]  Felipe Cucker,et al.  Best Choices for Regularization Parameters in Learning Theory: On the Bias—Variance Problem , 2002, Found. Comput. Math..

[221]  Jaouad Mourtada Exact minimax risk for linear least squares, and the lower tail of sample covariance matrices , 2019 .

[222]  Vee Ming Ng,et al.  On the estimation of parametric density functions , 1980 .

[223]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[224]  Luc Devroye,et al.  On the layered nearest neighbour estimate, the bagged nearest neighbour estimate and the random forest method in regression and classification , 2010, J. Multivar. Anal..

[225]  Stefan Wager,et al.  Estimation and Inference of Heterogeneous Treatment Effects using Random Forests , 2015, Journal of the American Statistical Association.

[226]  P. Massart,et al.  Minimum contrast estimators on sieves: exponential bounds and rates of convergence , 1998 .

[227]  Alex M. Andrew,et al.  Boosting: Foundations and Algorithms , 2012 .

[228]  E. Wigner On the Distribution of the Roots of Certain Symmetric Matrices , 1958 .

[229]  Ullrich Köthe,et al.  On Oblique Random Forests , 2011, ECML/PKDD.

[230]  Erwan Scornet,et al.  A random forest guided tour , 2015, TEST.

[231]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[232]  Hemant Ishwaran,et al.  Random Survival Forests , 2008, Wiley StatsRef: Statistics Reference Online.

[233]  Olivier Wintenberger,et al.  Optimal learning with Bernstein online aggregation , 2014, Machine Learning.

[234]  Jean-Yves Audibert,et al.  Progressive mixture rules are deviation suboptimal , 2007, NIPS.

[235]  J. Rissanen,et al.  ON SEQUENTIALLY NORMALIZED MAXIMUM LIKELIHOOD MODELS , 2008 .

[236]  H. Chipman,et al.  BART: Bayesian Additive Regression Trees , 2008, 0806.3286.

[237]  Yurii Nesterov,et al.  Interior-point polynomial algorithms in convex programming , 1994, Siam studies in applied mathematics.

[238]  H. White Maximum Likelihood Estimation of Misspecified Models , 1982 .

[239]  Robin Genuer,et al.  Variance reduction in purely random forests , 2012 .

[240]  T. O’Neil Geometric Measure Theory , 2002 .

[241]  C. Esseen On the Kolmogorov-Rogozin inequality for the concentration function , 1966 .

[242]  Wouter M. Koolen,et al.  Learning the Learning Rate for Prediction with Expert Advice , 2014, NIPS.

[243]  S. Smale,et al.  Learning Theory Estimates via Integral Operators and Their Approximations , 2007 .

[244]  T. Tao Topics in Random Matrix Theory , 2012 .

[245]  Jean-Philippe Vert,et al.  Consistency of Random Forests , 2014, 1405.2881.

[246]  G. D. Murray,et al.  NOTE ON ESTIMATION OF PROBABILITY DENSITY FUNCTIONS , 1977 .

[247]  Matthias W. Seeger,et al.  PAC-Bayesian Generalisation Error Bounds for Gaussian Process Classification , 2003, J. Mach. Learn. Res..

[248]  R. Dudley A course on empirical processes , 1984 .

[249]  Tamás Linder,et al.  Efficient Tracking of Large Classes of Experts , 2011, IEEE Transactions on Information Theory.

[250]  Luc Devroye,et al.  Consistency of Random Forests and Other Averaging Classifiers , 2008, J. Mach. Learn. Res..

[251]  Lee H. Dicker,et al.  Ridge regression and asymptotic minimax estimation over spheres of growing dimension , 2016, 1601.03900.

[252]  S. Geer,et al.  On higher order isotropy conditions and lower bounds for sparse quadratic forms , 2014, 1405.5995.

[253]  Elad Hazan,et al.  Introduction to Online Convex Optimization , 2016, Found. Trends Optim..

[254]  Manfred K. Warmuth,et al.  Relative Loss Bounds for On-Line Density Estimation with the Exponential Family of Distributions , 1999, Machine Learning.

[255]  Anne-Laure Boulesteix,et al.  Overview of random forest methodology and practical guidance with emphasis on computational biology and bioinformatics , 2012, WIREs Data Mining Knowl. Discov..

[256]  Larry Wasserman,et al.  All of Nonparametric Statistics (Springer Texts in Statistics) , 2006 .

[257]  Senén Barro,et al.  Do we need hundreds of classifiers to solve real world classification problems? , 2014, J. Mach. Learn. Res..

[258]  Don R. Hush,et al.  Optimal Rates for Regularized Least Squares Regression , 2009, COLT.

[259]  Fedor Zhdanov,et al.  Prediction with Expert Advice under Discounted Loss , 2010, ALT.

[260]  K. Hornik,et al.  party : A Laboratory for Recursive Partytioning , 2009 .

[261]  Vladimir Vovk,et al.  Prediction with Advice of Unknown Number of Experts , 2010, UAI.

[262]  A. Juditsky,et al.  Learning by mirror averaging , 2005, math/0511468.

[263]  C. J. Stone,et al.  Consistent Nonparametric Regression , 1977 .

[264]  Yoram Singer,et al.  Using and combining predictors that specialize , 1997, STOC '97.

[265]  Jean-Yves Audibert Fast learning rates in statistical inference through aggregation , 2007, math/0703854.

[266]  Aleksandrs Slivkins,et al.  25th Annual Conference on Learning Theory The Best of Both Worlds: Stochastic and Adversarial Bandits , 2022 .

[267]  M. Talagrand Upper and Lower Bounds for Stochastic Processes: Modern Methods and Classical Problems , 2014 .

[268]  Jorma Rissanen,et al.  Minimum Description Length Principle , 2010, Encyclopedia of Machine Learning.

[269]  Adrien Saumard On optimality of empirical risk minimization in linear aggregation , 2016, Bernoulli.

[270]  Julian Zimmert,et al.  Tsallis-INF: An Optimal Algorithm for Stochastic and Adversarial Bandits , 2018, J. Mach. Learn. Res..

[271]  Yu-Hsien Peng On Singular Values of Random Matrices , 2015 .

[272]  John Darzentas,et al.  Problem Complexity and Method Efficiency in Optimization , 1983 .

[273]  H. P. Annales de l'Institut Henri Poincaré , 1931, Nature.

[274]  Alessandro Rudi,et al.  Beyond Least-Squares: Fast Rates for Regularized Empirical Risk Minimization through Self-Concordance , 2019, COLT.

[275]  T. N. Sriram Asymptotics in Statistics–Some Basic Concepts , 2002 .

[276]  Arthur E. Hoerl,et al.  Application of ridge analysis to regression problems , 1962 .

[277]  Gilles Stoltz,et al.  A second-order bound with excess losses , 2014, COLT.

[278]  J. Berkson Application of the Logistic Function to Bio-Assay , 1944 .

[279]  Ron Meir,et al.  Generalization Error Bounds for Bayesian Mixture Algorithms , 2003, J. Mach. Learn. Res..

[280]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[281]  A. Barron,et al.  Jeffreys' prior is asymptotically least favorable under entropy risk , 1994 .

[282]  Ian R. Harris Predictive fit for natural exponential families , 1989 .

[283]  C. J. Stone,et al.  Additive Regression and Other Nonparametric Models , 1985 .

[284]  Alexandre B. Tsybakov,et al.  Optimal Rates of Aggregation , 2003, COLT.

[285]  O. Catoni Challenging the empirical mean and empirical variance: a deviation study , 2010, 1009.2048.

[286]  Haipeng Luo,et al.  Logistic Regression: The Importance of Being Improper , 2018, COLT.

[287]  David Haussler,et al.  Sequential Prediction of Individual Sequences Under General Loss Functions , 1998, IEEE Trans. Inf. Theory.

[288]  Stéphane Gaïffas,et al.  An improper estimator with optimal excess risk in misspecified density estimation and logistic regression , 2019, ArXiv.

[289]  Gérard Biau,et al.  Analysis of a Random Forests Model , 2010, J. Mach. Learn. Res..

[290]  H. Chipman,et al.  Bayesian CART Model Search , 1998 .

[291]  Ameet Talwalkar,et al.  Foundations of Machine Learning , 2012, Adaptive computation and machine learning.

[292]  V. Rocková,et al.  Posterior Concentration for Bayesian Regression Trees and their Ensembles , 2017 .

[293]  Julian Zimmert,et al.  Beating Stochastic and Adversarial Semi-bandits Optimally and Simultaneously , 2019, ICML.

[294]  E. Rowland Theory of Games and Economic Behavior , 1946, Nature.

[295]  Gábor Lugosi,et al.  Introduction to Statistical Learning Theory , 2004, Advanced Lectures on Machine Learning.

[296]  Stéphane Gaïffas,et al.  Universal consistency and minimax rates for online Mondrian Forests , 2017, NIPS.

[297]  David A. McAllester Some PAC-Bayesian Theorems , 1998, COLT' 98.

[298]  C. Tracy,et al.  Introduction to Random Matrices , 1992, hep-th/9210073.

[299]  Dean Phillips Foster Prediction in the Worst Case , 1991 .

[300]  Shahar Mendelson,et al.  Learning without Concentration , 2014, COLT.

[301]  J. Berger Statistical Decision Theory and Bayesian Analysis , 1988 .

[302]  Ohad Shamir,et al.  Learnability, Stability and Uniform Convergence , 2010, J. Mach. Learn. Res..

[303]  Mihaela Aslan,et al.  Asymptotically minimax Bayes predictive densities , 2006, 0708.0177.

[304]  P. Bartlett,et al.  Local Rademacher complexities , 2005, math/0508275.

[305]  Lorenzo Rosasco,et al.  Learning with Incremental Iterative Regularization , 2014, NIPS.

[306]  Edward I. George,et al.  Admissible predictive density estimation , 2008 .

[307]  Y. Shtarkov,et al.  Sequential Weighting Algorithms for Multi-Alphabet Sources ∗ , 1993 .

[308]  M. Rudelson,et al.  Non-asymptotic theory of random matrices: extreme singular values , 2010, 1003.2990.

[309]  Jian-Feng Yao,et al.  Convergence Rates of Spectral Distributions of Large Sample Covariance Matrices , 2003, SIAM J. Matrix Anal. Appl..

[310]  Mikhail Belkin,et al.  Two models of double descent for weak features , 2019, SIAM J. Math. Data Sci..

[311]  S. Mendelson,et al.  Performance of empirical risk minimization in linear aggregation , 2014, 1402.5763.

[312]  R. Adamczak,et al.  Sharp bounds on the rate of convergence of the empirical covariance matrix , 2010, 1012.0294.

[313]  Peter L. Bartlett,et al.  Horizon-Independent Optimal Prediction with Log-Loss in Exponential Families , 2013, COLT.

[314]  Sanjoy Dasgupta,et al.  Random projection trees and low dimensional manifolds , 2008, STOC.

[315]  V. Koltchinskii,et al.  Oracle inequalities in empirical risk minimization and sparse recovery problems , 2011 .

[316]  Wouter M. Koolen,et al.  Second-order Quantile Methods for Experts and Combinatorial Games , 2015, COLT.

[317]  David A. McAllester Simplified PAC-Bayesian Margin Bounds , 2003, COLT.

[318]  Andrea Montanari,et al.  High dimensional robust M-estimation: asymptotic variance via approximate message passing , 2013, Probability Theory and Related Fields.

[319]  Frans M. J. Willems,et al.  The Context-Tree Weighting Method : Extensions , 1998, IEEE Trans. Inf. Theory.

[320]  Michael R. Kosorok,et al.  Some asymptotic results of survival tree and forest models , 2017 .

[321]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[322]  Martin Zinkevich,et al.  Online Convex Programming and Generalized Infinitesimal Gradient Ascent , 2003, ICML.

[323]  J. Picard,et al.  Statistical learning theory and stochastic optimization : École d'eté de probabilités de Saint-Flour XXXI - 2001 , 2004 .

[324]  R. Bhatia Positive Definite Matrices , 2007 .

[325]  L. Breiman CONSISTENCY FOR A SIMPLE MODEL OF RANDOM FORESTS , 2004 .

[326]  Boris Polyak,et al.  Acceleration of stochastic approximation by averaging , 1992 .

[327]  Frans M. J. Willems,et al.  Coding for a binary independent piecewise-identically-distributed source , 1996, IEEE Trans. Inf. Theory.

[328]  Adrian F. M. Smith,et al.  A Bayesian CART algorithm , 1998 .

[329]  John K Kruschke,et al.  Bayesian data analysis. , 2010, Wiley interdisciplinary reviews. Cognitive science.

[330]  David Haussler,et al.  How to use expert advice , 1993, STOC.

[331]  Haipeng Luo,et al.  A Drifting-Games Analysis for Online Learning and Applications to Boosting , 2014, NIPS.

[332]  Claudio Gentile,et al.  On the generalization ability of on-line learning algorithms , 2001, IEEE Transactions on Information Theory.

[333]  Yoav Freund,et al.  A Parameter-free Hedging Algorithm , 2009, NIPS.

[334]  AI Koan,et al.  Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning , 2008, NIPS.

[335]  Wouter M. Koolen,et al.  Minimax Fixed-Design Linear Regression , 2015, COLT.