Learning Minimax Estimators via Online Learning

We consider the problem of designing minimax estimators for estimating the parameters of a probability distribution. Unlike classical approaches such as the MLE and minimum distance estimators, we consider an algorithmic approach for constructing such estimators. We view the problem of designing minimax estimators as finding a mixed strategy Nash equilibrium of a zero-sum game. By leveraging recent results in online learning with non-convex losses, we provide a general algorithm for finding a mixed-strategy Nash equilibrium of general non-convex non-concave zero-sum games. Our algorithm requires access to two subroutines: (a) one which outputs a Bayes estimator corresponding to a given prior probability distribution, and (b) one which computes the worst-case risk of any given estimator. Given access to these two subroutines, we show that our algorithm outputs both a minimax estimator and a least favorable prior. To demonstrate the power of this approach, we use it to construct provably minimax estimators for classical problems such as estimation in the finite Gaussian sequence model, and linear regression.

[1]  H. Brendan McMahan,et al.  A survey of Algorithms and Analysis for Adaptive Online Learning , 2014, J. Mach. Learn. Res..

[2]  Elad Hazan,et al.  Introduction to Online Convex Optimization , 2016, Found. Trends Optim..

[3]  E. Rowland Theory of Games and Economic Behavior , 1946, Nature.

[4]  E. B. Yanovskaya Infinite zero-sum two-person games , 1974 .

[5]  R. Z. Khasʹminskiĭ,et al.  Statistical estimation : asymptotic theory , 1981 .

[6]  D. Donoho,et al.  Minimax Risk Over Hyperrectangles, and Implications , 1990 .

[7]  Inderjit S. Dhillon,et al.  Clustering on the Unit Hypersphere using von Mises-Fisher Distributions , 2005, J. Mach. Learn. Res..

[8]  P. Massart,et al.  Rates of convergence for minimum contrast estimators , 1993 .

[9]  C. Stein,et al.  Estimation with Quadratic Loss , 1992 .

[10]  Praneeth Netrapalli,et al.  Online Non-Convex Learning: Following the Perturbed Leader is Optimal , 2019, ALT.

[11]  Alon Gonen,et al.  Learning in Non-convex Games with an Optimization Oracle , 2018, COLT.

[12]  G. Casella,et al.  Estimating a Bounded Normal Mean , 1981 .

[13]  P. Bickel Minimax Estimation of the Mean of a Normal Distribution when the Parameter Space is Restricted , 1981 .

[14]  Yuhong Yang,et al.  Information-theoretic determination of minimax rates of convergence , 1999 .

[15]  Yihong Wu,et al.  Minimax Rates of Entropy Estimation on Large Alphabets via Best Polynomial Approximation , 2014, IEEE Transactions on Information Theory.

[16]  François Perron,et al.  On the minimax estimator of a bounded normal mean , 2002 .

[17]  Santosh S. Vempala,et al.  Efficient algorithms for online decision problems , 2005, J. Comput. Syst. Sci..

[18]  W. Nelson,et al.  Minimax Solution of Statistical Decision Problems by Iteration , 1966 .

[19]  Peter J. Kempthorne,et al.  Numerical specification of discrete least favorable prior distributions , 1987 .

[20]  Kirthevasan Kandasamy,et al.  Tuning Hyperparameters without Grad Students: Scalable and Robust Bayesian Optimisation with Dragonfly , 2019, J. Mach. Learn. Res..

[21]  J. Calvin Berry,et al.  Minimax estimation of a bounded normal mean vector , 1990 .

[22]  A. Barron,et al.  Jeffreys' prior is asymptotically least favorable under entropy risk , 1994 .

[23]  Alexandre B. Tsybakov,et al.  Introduction to Nonparametric Estimation , 2008, Springer series in statistics.

[24]  Noah Simon,et al.  Learning to learn from data: Using deep adversarial learning to construct optimal statistical procedures , 2020, Science Advances.

[25]  Robert S. Chen,et al.  Robust Optimization for Non-Convex Objectives , 2017, NIPS.

[26]  R. Wijsman Invariant measures on groups and their use in statistics , 1990 .

[27]  Yihong Wu,et al.  Dualizing Le Cam's method, with applications to estimating the unseens , 2019, ArXiv.

[28]  J. Imhof Computing the distribution of quadratic forms in normal variables , 1961 .

[29]  Alexandre M. Bayen,et al.  The Hedge Algorithm on a Continuum , 2015, ICML.

[30]  M. N. Ghosh UNIFORM APPROXIMATION OF MINIMAX POINT ESTIMATES , 1964 .

[31]  Lucien Birgé Approximation dans les espaces métriques et théorie de l'estimation , 1983 .

[32]  L. L. Cam,et al.  Asymptotic Methods In Statistical Decision Theory , 1986 .

[33]  Abraham Wald,et al.  Statistical Decision Functions , 1951 .

[34]  A. Hald The size of bayes and minimax tests as function of the sample size and the loss ratio , 1971 .

[35]  J. A. Hartigan,et al.  Asymptotic Normality of Posterior Distributions , 1983 .

[36]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[37]  A. Wood,et al.  Saddlepoint approximations for the Bingham and Fisher–Bingham normalising constants , 2005 .

[38]  Yoav Freund,et al.  Game theory, on-line prediction and boosting , 1996, COLT '96.

[39]  E. L. Lehmann,et al.  Theory of point estimation , 1950 .

[40]  M. Stone,et al.  Mathematical Statistics: A Decision Theoretic Approach , 1968 .

[41]  Alexander J. Smola,et al.  Deep Sets , 2017, 1703.06114.

[42]  Uriel Feige,et al.  Learning and inference in the presence of corrupted inputs , 2015, COLT.

[43]  J. Berger Statistical Decision Theory and Bayesian Analysis , 1988 .

[44]  A. Tsybakov,et al.  Variable selection with Hamming loss , 2015, The Annals of Statistics.

[45]  L. Brown,et al.  Measurable Selections of Extrema , 1973 .

[46]  Gregory Valiant,et al.  The Power of Linear Estimators , 2011, 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science.

[47]  T. Cai,et al.  Testing composite hypotheses, Hermite polynomials and optimal estimation of a nonsmooth functional , 2011, 1105.3039.

[48]  Yanjun Han,et al.  Minimax Estimation of Functionals of Discrete Distributions , 2014, IEEE Transactions on Information Theory.