Exchangeability Characterizes Optimality of Sequential Normalized Maximum Likelihood and Bayesian Prediction

We study online learning under logarithmic loss with regular parametric models. In this setting, each strategy corresponds to a joint distribution on sequences. The minimax optimal strategy is the <italic>normalized maximum likelihood</italic> (<italic>NML</italic>) strategy. We show that the <italic>sequential NML</italic> (<italic>S</italic>NML) strategy predicts minimax optimally (i.e., as NML) if and only if the joint distribution on sequences defined by SNML is exchangeable. This property also characterizes the optimality of a Bayesian prediction strategy. In that case, the optimal prior distribution is Jeffreys prior for a broad class of parametric models for which the maximum likelihood estimator is asymptotically normal. The optimal prediction strategy, NML, depends on the number <inline-formula> <tex-math notation="LaTeX">$n$ </tex-math></inline-formula> of rounds of the game, in general. However, when a Bayesian strategy is optimal, NML becomes independent of <inline-formula> <tex-math notation="LaTeX">$n$ </tex-math></inline-formula>. Our proof uses this to exploit the asymptotics of NML. The asymptotic normality of the maximum likelihood estimator is responsible for the necessity of Jeffreys prior.

[1]  H. Jeffreys An invariant form for the prior probability in estimation problems , 1946, Proceedings of the Royal Society of London. Series A. Mathematical and Physical Sciences.

[2]  W. Newey,et al.  Large sample estimation and hypothesis testing , 1986 .

[3]  Andrew R. Barron,et al.  Information-theoretic asymptotics of Bayes methods , 1990, IEEE Trans. Inf. Theory.

[4]  D. Freedman,et al.  Cauchy's equation and de Finetti's theorem , 1990 .

[5]  A. Barron,et al.  Jeffreys' prior is asymptotically least favorable under entropy risk , 1994 .

[6]  Andrew R. Barron,et al.  Minimax redundancy for the class of memoryless sources , 1997, IEEE Trans. Inf. Theory.

[7]  Neri Merhav,et al.  Universal Prediction , 1998, IEEE Trans. Inf. Theory.

[8]  A. Barron,et al.  Asymptotically minimax regret by Bayes mixtures , 1998, Proceedings. 1998 IEEE International Symposium on Information Theory (Cat. No.98CH36252).

[9]  Jorma Rissanen,et al.  MDL Denoising , 2000, IEEE Trans. Inf. Theory.

[10]  Jorma Rissanen,et al.  Efficient Computation of Stochastic Complexity , 2003 .

[11]  Manfred K. Warmuth,et al.  Relative Loss Bounds for On-Line Density Estimation with the Exponential Family of Distributions , 1999, Machine Learning.

[12]  Petri Myllymäki,et al.  A Fast Normalized Maximum Likelihood Algorithm for Multinomial Data , 2005, IJCAI.

[13]  A. Barron,et al.  Asymptotically minimax regret for exponential families , 2005 .

[14]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[15]  P. Grünwald The Minimum Description Length Principle (Adaptive Computation and Machine Learning) , 2007 .

[16]  Dongming Zhu,et al.  A Generalized Asymmetric Student-t Distribution with Application to Financial Econometrics , 2009 .

[17]  Jorma Rissanen,et al.  Minimum Description Length Principle , 2010, Encyclopedia of Machine Learning.

[18]  Wojciech Kotlowski,et al.  Maximum Likelihood vs. Sequential Normalized Maximum Likelihood in On-line Density Estimation , 2011, COLT.

[19]  Dongming Zhu,et al.  Modeling and forecasting expected shortfall with the generalized asymmetric Student-t and asymmetric exponential power distributions , 2011 .

[20]  P. Bartlett,et al.  The Optimality of Jeffreys Prior for Online Density Estimation and the Asymptotic Normality of Maximum Likelihood Estimators , 2012, COLT.

[21]  Peter L. Bartlett,et al.  Horizon-Independent Optimal Prediction with Log-Loss in Exponential Families , 2013, COLT.