A Rigorous Investigation of “Evidence” and “Occam Factors” in Bayesian Reasoning

This paper first reviews the reasoning behind the Bayesian "evidence" procedure for setting parameters in the probability distributions involved in inductive inference. This paper then proves that the evidence procedure is incorrect. More precisely, this paper proves that the assumptions going into the evidence procedure do not, as claimed, "let the data determine the distributions". Instead, those assumptions simply amount to an implicit replacement of the original distributions, containing free parameters, with new distributions, none of whose parameters are free. For example, as used by MacKay [1991] in the context of neural nets, the evidence procedure is a means for using the training set to determine the free parameter ex in the distribution P(Iwil) oc exp(ro:: 1 Wi2), where the N Wi are the N weights in the network. As this paper proves, in actuality the assumptions going into MacKay's use of the evidence procedure do not result in a distribution P(lwil) oc exp(ro::1 w?) for some ex, but rather result in a parameter-less distribution, P(lwil) oc (L:1 w?r CN!2 + 1). This paper goes on to prove that ifone makes the assumption of an "entropic prior" with unknown parameter value, in addition to the assumptions used in the evidence procedure, then the prior is completely fixed, but in a form which can not be entropic. (This calls into question the self-consistency of the numerous arguments purporting to derive an entropic prior "from first principles".) Finally, this paper goes on to investigate the Bayesian first-principles "proof' of Occam's razor involving Occam factors. This paper proves that that "proof' is flawed.

[1]  L. M. M.-T. Theory of Probability , 1929, Nature.

[2]  O. William Journal Of The American Statistical Association V-28 , 1932 .

[3]  R. Cox,et al.  Journal of the Royal Statistical Society B , 1972 .

[4]  B. Efron Computers and the Theory of Statistics: Thinking the Unthinkable , 1979 .

[5]  Rodney W. Johnson,et al.  Axiomatic derivation of the principle of maximum entropy and the principle of minimum cross-entropy , 1980, IEEE Trans. Inf. Theory.

[6]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, STOC '84.

[7]  Y. Tikochinsky,et al.  Alternative approach to maximum-entropy inference , 1984 .

[8]  Y. Tikochinsky,et al.  Consistent inference of probabilities for reproducible experiments , 1984 .

[9]  David Haussler,et al.  Occam's Razor , 1987, Inf. Process. Lett..

[10]  J. Skilling Maximum entropy and bayesian methods : 8 : 1988 , 1989 .

[11]  Stephen F. Gull,et al.  Developments in Maximum Entropy Data Analysis , 1989 .

[12]  Naftali Tishby,et al.  Consistent inference of probabilities in layered networks: predictions and generalizations , 1989, International 1989 Joint Conference on Neural Networks.

[13]  David Haussler,et al.  Learnability and the Vapnik-Chervonenkis dimension , 1989, JACM.

[14]  T. Loredo From Laplace to Supernova SN 1987A: Bayesian Inference in Astrophysics , 1990 .

[15]  David H. Wolpert,et al.  The Relationship Between Occam's Razor and Convergent Guessing , 1990, Complex Syst..

[16]  Vijay K. Samalam,et al.  Exhaustive Learning , 1990, Neural Computation.

[17]  David Haussler,et al.  Bounds on the sample complexity of Bayesian learning using information theory and the VC dimension , 1991, COLT '91.

[18]  Opper,et al.  Generalization performance of Bayes optimal classification algorithm for learning a perceptron. , 1991, Physical review letters.

[19]  Anders Krogh,et al.  Introduction to the theory of neural computation , 1994, The advanced book program.

[20]  D. Mackay,et al.  A Practical Bayesian Framework for Backprop Networks , 1991 .

[21]  Wray L. Buntine,et al.  Bayesian Back-Propagation , 1991, Complex Syst..

[22]  David H. Wolpert,et al.  On the Connection between In-sample Testing and Generalization Error , 1992, Complex Syst..

[23]  Sompolinsky,et al.  Statistical mechanics of learning from examples. , 1992, Physical review. A, Atomic, molecular, and optical physics.

[24]  Michael Kearns,et al.  Bounds on the sample complexity of Bayesian learning using information theory and the VC dimension , 1992, [Proceedings 1992] IJCNN International Joint Conference on Neural Networks.

[25]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[26]  I. Johnstone,et al.  Maximum Entropy and the Nearly Black Object , 1992 .

[27]  October I Physical Review Letters , 2022 .