Variational Reference Priors

Posterior distributions are useful for a broad range of tasks in machine learning ranging from model selection to reinforcement learning. Given that modern machine learning models can have millions of parameters, selecting an informative prior is typically infeasible, resulting in widespread use of priors that avoid strong assumptions. For example, recent work on deep generative models (Kingma & Welling, 2014; Rezende et al., 2014) commonly uses the standard Normal distribution for the prior on the latent space. However, just because a prior is relatively flat does not mean it is uninformative. The Jeffreys prior for the Bernoulli model serves as a well-known counter example: Jeffreys (1946) showed that the arcsine distribution, despite its peaks near 0 and 1, is the truly objective prior (with respect to Fisher information) and not the uniform distribution. This suggests that objective priors such as the Jeffreys or the related Reference prior (Bernardo, 2005) are worthy of investigation for high-dimensional, web-scale probabilistic models. However, the challenge is that these priors are difficult to derive for all but the simplest models.