A key sticking point of Bayesian analysis is the choice of prior distribution, and there is a vast literature on potential defaults including uniform priors, Jeffreys’ priors, reference priors, maximum entropy priors, and weakly informative priors. These methods, however, often manifest a key conceptual tension in prior modeling: a model encoding true prior information should be chosen without reference to the model of the measurement process, but almost all common prior modeling techniques are implicitly motivated by a reference likelihood. In this paper we resolve this apparent paradox by placing the choice of prior into the context of the entire Bayesian analysis, from inference to prediction to model evaluation. 1. The role of the prior distribution in a Bayesian analysis Both in theory and in practice, the prior distribution can play many roles in a Bayesian analysis. Perhaps most formally the prior serves to encode information germane to the problem being analyzed, but in practice it often becomes a means of stabilizing inferences in complex, high-dimensional problems. In other settings it is treated as little more than a nuisance, serving simply as a catalyst for the expression of uncertainty via Bayes’ theorem. These different roles often motivate a distinction between “subjective” and “objective” choices of priors, but we are unconvinced of the relevance of this distinction (Gelman and Hennig, 2017). We prefer to characterize Bayesian priors, and statistical models more generally, based on the information they include rather than the philosophical interpretation of that information. The ultimate significance of this information, and hence the prior itself, depends on exactly how that information manifests in the final analysis. Consequently the influence of the prior can only be judged within the context of the likelihood. In the present paper we address an apparent paradox: Logically, the prior distribution should come before the data model, but in practice, priors are often chosen with reference to a likelihood function. We resolve this puzzle in two ways, first with a robustness argument, recognizing that our models are only approximate, and in particular the relevance to any given data analysis of particular assumptions in the prior distribution depends on the likelihood; and, second, by considering the different roles that the prior plays in different Bayesian analyses. 1.1. The practical consequences of a prior can depend on the data One might say that what makes a prior a prior, rather than simply a probability distribution, is that it is destined to be paired with a likelihood. That is, the Bayesian formalism requires that a prior distribution be updated into a posterior distribution based on new data. We thank Matt Hoffman for helpful comments and the National Science Foundation, Office of Naval Research, Institute for Education Sciences, and Sloan Foundation for partial support of this work. Department of Statistics and Department of Political Science, Columbia University. Department of Statistical Sciences, University of Toronto. Institute for Social and Economic Research and Policy, Columbia University. The practical utility of a prior distribution within a given analysis then depends critically on both how it interacts with the assumed probability model for the data in the context of the actual data that are observed. Consider, for example, a simple binomial likelihood with n = 75 trials and some prior on the success probability, p. If you observe y = 40 then you can readily compute the posterior and consider issues of prior sensitivity and predictive performance regardless of the choice of prior. But what if you observe y = 75? Then suddenly you need to be very careful with the choice of prior to ensure that your inferences don’t blow up. This doesn’t imply that the prior should explicitly depend on the measured data, just that a prior that works well in one scenario might be problematic in another. Consequently, to ensure a robust analysis we have to go beyond the standard Bayesian workflow where the prior distribution is meant to be chosen with no reference to the data and, ideally, the data generating experiment itself. 1.2. Existing methods for setting priors already depend on the likelihood This tension between the conceptual interpretation of the prior and more practical considerations has largely split the long literature of prior choice into two sides: either you build a fully subjective prior distribution with no knowledge of the likelihood, or you leverage at least some aspects of your likelihood to build your prior. We refer to the first of these positions as maximalist in that the prior distribution represents, at least ideally, all available information about the problem known before the measurement is considered. The maximalist prior is implicitly backed up by the Bayesian’s willingness to bet on it. Any prior that isn’t fully informative but has any sort of theoretical or practical benefit leans heavily on some aspect of the likelihood. The classic example of this is building priors from the minimalist position which takes data and a model of the measurement process, and considers a prior as little more than an annoying step required to perform a Bayesian analysis. From this perspective, a natural starting point is a noninformative prior. Although it is impossible to define “noninformative” with any rigor, the general idea is that such a prior affects the information in the likelihood as weakly as possible. In practice the drive for noninformativity leads to the naive use of uniform distributions as the limit of an infinitely diffuse probability distribution. Related is the idea of the reference prior (Bernardo, 1979) which, again, serves as a placeholder to allow Bayesian inference to go forward with minimal problem-specific assumptions. These assumptions frequently require the statistician to replace knowledge of the likelihood with an asymptotic approximation, with the validity of this asymptotic regime ultimately affecting the practical performance of the prior. A structural prior encodes mathematical properties such as symmetry that represent underlying features of a model. Examples of structural information include exchangeability in hierarchical models and maximum entropy models in physics, which Jaynes (1982) and others have applied to more general statistical settings. A structural prior is not minimalist as it includes information about the underlying problem which is not driven by the measurement process, but neither is it maximalist as it does not attempt to include all available information about the problem at hand. It also makes the implicit the assumption that the structural information is consistent with reasonable data generating processes. A regularizing prior is designed to yield smoother, more stable inferences than would be obtained from maximum likelihood estimation or Bayesian inference with a flat prior. Exactly how a regularizing prior accomplishes this goal clearly depends on the exact nature of the likelihood itself. Regularization, even if applied in a Bayesian context, is a frequentist goal (Rubin, 1984) in that
[1]
Benjamin Shaby,et al.
The role of the range parameter for estimation and prediction in geostatistics
,
2011,
1108.1851.
[2]
J. Bernardo.
Reference Posterior Distributions for Bayesian Inference
,
1979
.
[3]
S. Kanazawa.
Beautiful parents have more daughters: a further implication of the generalized Trivers-Willard hypothesis (gTWH).
,
2007,
Journal of theoretical biology.
[4]
M. G. Pittau,et al.
A weakly informative default prior distribution for logistic and other regression models
,
2008,
0901.4011.
[5]
P. Gustafson,et al.
Conservative prior distributions for variance parameters in hierarchical models
,
2006
.
[6]
A. Gelman,et al.
Beyond subjective and objective in statistics
,
2015,
1508.05453.
[7]
J. Berger,et al.
The Intrinsic Bayes Factor for Model Selection and Prediction
,
1996
.
[8]
David B. Dunson,et al.
Bayesian data analysis, third edition
,
2013
.
[9]
Haotian Hang,et al.
Inconsistent Estimation and Asymptotically Equal Interpolations in Model-Based Geostatistics
,
2004
.
[10]
Adrian E. Raftery,et al.
Bayes factors and model uncertainty
,
1995
.
[11]
James G. Scott,et al.
On the half-cauchy prior for a global scale parameter
,
2011,
1104.4937.
[12]
E. Jaynes.
On the rationale of maximum-entropy methods
,
1982,
Proceedings of the IEEE.
[13]
A. O'Hagan,et al.
Fractional Bayes factors for model comparison
,
1995
.
[14]
Nadja Klein,et al.
Scale-Dependent Priors for Variance Parameters in Structured Additive Distributional Regression
,
2016
.
[15]
Van Der Vaart,et al.
Adaptive Bayesian estimation using a Gaussian random field with inverse Gamma bandwidth
,
2009,
0908.3556.
[16]
Roger Woodard,et al.
Interpolation of Spatial Data: Some Theory for Kriging
,
1999,
Technometrics.
[17]
Haavard Rue,et al.
Constructing Priors that Penalize the Complexity of Gaussian Random Fields
,
2015,
Journal of the American Statistical Association.
[18]
L. Wasserman,et al.
The Selection of Prior Distributions by Formal Rules
,
1996
.
[19]
Thiago G. Martins,et al.
Penalising Model Component Complexity: A Principled, Practical Approach to Constructing Priors
,
2014,
1403.4630.
[20]
Aki Vehtari,et al.
Projection predictive variable selection using Stan+R
,
2015
.
[21]
D. Rubin.
Bayesianly Justifiable and Relevant Frequency Calculations for the Applied Statistician
,
1984
.
[22]
Andrew Gelman,et al.
Of beauty, sex, and power: Statistical challenges in estimating small eects
,
2008
.
[23]
Andrew Gelman,et al.
Bayesian Model-Building By Pure Thought: Some Principles and Examples
,
1994
.