Bayesian Learning

10 Theoretical Background 11 Bayesian methods have undergone tremendous progress 12 in recent years, due largely to mathematical advances in 13 probability and estimation theory (Chater et al. 2006). 14 These advances have allowed theorists to express and 15 derive predictions from far more sophisticated models 16 than previously possible. These models have generated 17 a good deal of excitement for at least two reasons. First, 18 they offer a new interpretation of the goals of cognitive 19 systems, in terms of inductive probabilistic inference, 20 which has revived attempts at rational explanation of 21 human behavior (Oaksford and Chater 2007). Second, 22 Bayesian models may have the potential to explain some 23 of the most complex aspects of human cognition, such as 24 language acquisition or reasoning under uncertainty, 25 where structured information and incomplete knowledge 26 combine in a way that has defied previous approaches 27 (e.g., Kemp and Tenenbaum 2008). 28 Constructing a Bayesianmodel involves two steps. The 29 first step is to specify the set of possibilities for the state of 30 the world, which is referred to as the hypothesis space. 31 Each hypothesis can be thought of as a prediction by the 32 subject about what future sensory information will be 33 encountered. However, the term hypothesis should not 34 be confused with its more traditional usage in psychology, 35 connoting explicit testing of rules or other symbolically 36 represented propositions. In the context of Bayesian 37 modeling, hypotheses need have nothing to do with 38 explicit reasoning, and indeed the Bayesian framework 39 makes no commitment whatsoever on this issue. 40 For example, in Bayesian models of visual processing, 41 hypotheses can correspond to extremely low-level infor42 mation, such as the presence of elementary visual features 43 (contours, etc.) at various locations in the visual field 44 (Geisler et al. 2001). There is also no commitment regard45 ing where the hypotheses come from. Hypotheses could 46 represent innate biases or knowledge, or they could have 47 been learned previously by the individual. Thus, the 48 framework has no position on nativist–empiricist debates. 49 Furthermore, hypotheses representing very different types 50 of information (e.g., a contour in a particular location, 51 whether or not the image reminds you of your mother, 52 whether the image is symmetrical, whether it spells 53 a particular word, etc.) are all lumped together in 54 a common hypothesis space and treated equally by the 55 model. Thus, there is no distinction between different 56 types of representations or knowledge systems within the 57 brain. In general, a hypothesis is nothing more than 58 a probability distribution. This distribution, referred to 59 as the likelihood function, simply specifies how likely each 60 possible pattern of observations is according to the 61 hypothesis in question. 62 The second step in constructing a Bayesian model is to 63 specify how strongly the subject believes in each hypoth64 esis before observing data. This initial belief is expressed as 65 a probability distribution over the hypothesis space, and is 66 referred to as the prior. The prior can be thought of as an 67 initial bias in favor of some hypotheses over others, in that 68 it contributes extra “votes” (as elaborated below) that are 69 independent of any actual data. This decisional bias allows 70 the model’s predictions to be shifted in arbitrary direc71 tions regardless of the data. As we discuss below, the prior 72 can be a strong point of the model if it is derived inde73 pendently, from empirical statistics of real environments. 74 However, more commonly, the prior is chosen ad hoc, 75 providing substantial unconstrained flexibility to models 76 that are advocated as rational and assumption-free. 77 Together, the hypotheses and the prior fully determine 78 a Bayesian model. The model’s goal is to decide how 79 strongly to believe in each hypothesis after data have 80 been observed. This final belief is again expressed as 81 a probability distribution over the hypothesis space and 82 is referred to as the posterior. The statistical identity known

[1]  Samuel J. Gershman,et al.  A Tutorial on Bayesian Nonparametric Models , 2011, 1106.2697.

[2]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[3]  Akimichi Takemura,et al.  Tail probabilities of the maxima of multilinear forms and their applications , 2001 .

[4]  Hilbert J. Kappen,et al.  Nonmonotonic Generalization Bias of Gaussian Mixture Models , 2000, Neural Computation.

[5]  L. K. Hansen,et al.  Automatic relevance determination for multi‐way models , 2009 .

[6]  James C. Spall,et al.  Introduction to stochastic search and optimization - estimation, simulation, and control , 2003, Wiley-Interscience series in discrete mathematics and optimization.

[7]  Kurt Hornik,et al.  Learning in linear neural networks: a survey , 1995, IEEE Trans. Neural Networks.

[8]  Shinichi Nakajima,et al.  Global analytic solution of fully-observed variational Bayesian matrix factorization , 2013, J. Mach. Learn. Res..

[9]  H. Robbins A Stochastic Approximation Method , 1951 .

[10]  Kazuho Watanabe,et al.  Stochastic complexities of general mixture models in variational Bayesian learning , 2007, Neural Networks.

[11]  David C. Hoyle,et al.  Automatic PCA Dimension Selection for High Dimensional Data and Small Sample Sizes , 2008 .

[12]  J. W. Silverstein,et al.  Eigenvalues of large sample covariance matrices of spiked population models , 2004, math/0408165.

[13]  B. Efron,et al.  Stein's Estimation Rule and Its Competitors- An Empirical Bayes Approach , 1973 .

[14]  René Vidal,et al.  Low rank subspace clustering (LRSC) , 2014, Pattern Recognit. Lett..

[15]  Katsuyuki Hagiwara On the Problem in Model Selection of Neural Network Regression in Overrealizable Scenario , 2002, Neural Computation.

[16]  David J. C. MacKay,et al.  Bayesian Interpolation , 1992, Neural Computation.

[17]  E. Gassiat,et al.  Testing in locally conic models, and application to mixture models , 1997 .

[18]  George Eastman House,et al.  Sparse Bayesian Learning and the Relevance Vector Machine , 2001 .

[19]  Michael E. Tipping,et al.  Probabilistic Principal Component Analysis , 1999 .

[20]  Tsung-Lin Lee,et al.  HOM4PS-2.0: a software package for solving polynomial systems by the polyhedral homotopy continuation method , 2008, Computing.

[21]  Keisuke Yamazaki Asymptotic accuracy of Bayes estimation for latent variables with redundancy , 2015, Machine Learning.

[22]  Cristian Pasarica,et al.  Testing Homogeneity in Gamma Mixture Models , 2003 .

[23]  Akiko Takeda,et al.  PHoM – a Polyhedral Homotopy Continuation Method for Polynomial Systems , 2004, Computing.

[24]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[25]  Shun-ichi Amari,et al.  Learning Coefficients of Layered Models When the True Distribution Mismatches the Singularities , 2003, Neural Computation.

[26]  Kazuho Watanabe An alternative view of variational Bayes and asymptotic approximations of free energy , 2011, Machine Learning.

[27]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[28]  Dan Geiger,et al.  Asymptotic Model Selection for Naive Bayesian Networks , 2002, J. Mach. Learn. Res..

[29]  Charles Kemp,et al.  The discovery of structural form , 2008, Proceedings of the National Academy of Sciences.

[30]  Keisuke Yamazaki,et al.  Comparing two Bayes methods based on the free energy functions in Bernoulli mixtures , 2013, Neural Networks.

[31]  Richard A. Harshman,et al.  Foundations of the PARAFAC procedure: Models and conditions for an "explanatory" multi-model factor analysis , 1970 .

[32]  Kenji Fukumizu,et al.  Relation between weight size and degree of over-fitting in neural network regression , 2008, Neural Networks.

[33]  Inderjit S. Dhillon,et al.  Clustering with Bregman Divergences , 2005, J. Mach. Learn. Res..

[34]  H. Hotelling Analysis of a complex of statistical variables into principal components. , 1933 .

[35]  R. Tibshirani,et al.  Generalized additive models for medical research , 1986, Statistical methods in medical research.

[36]  J. Tenenbaum,et al.  Bayesian Special Section Learning Overhypotheses with Hierarchical Bayesian Models , 2022 .

[37]  Kazuho Watanabe,et al.  Divergence measures and a general framework for local variational approximation , 2011, Neural Networks.

[38]  Finn V. Jensen,et al.  Bayesian Networks and Decision Graphs , 2001, Statistics for Engineering and Information Science.

[39]  Marvin H. J. Guber Bayesian Spectrum Analysis and Parameter Estimation , 1988 .

[40]  K. Wachter The Strong Limits of Random Matrix Spectra for Sample Matrices of Independent Elements , 1978 .

[41]  Shinichi Nakajima,et al.  Theoretical Analysis of Bayesian Matrix Factorization , 2011, J. Mach. Learn. Res..

[42]  Pernille Hemmer,et al.  Moving beyond qualitative evaluations of Bayesian models of cognition , 2015, Psychonomic bulletin & review.

[43]  Esther Levin,et al.  A statistical approach to learning and generalization in layered neural networks , 1989, Proc. IEEE.

[44]  Shinichi Nakajima,et al.  Condition for perfect dimensionality recovery by variational Bayesian PCA , 2015, J. Mach. Learn. Res..

[45]  T. Han,et al.  Mathematics of information and coding , 2001 .

[46]  Shun-ichi Amari,et al.  Identifiability of hidden Markov information sources and their minimum degrees of freedom , 1992, IEEE Trans. Inf. Theory.

[47]  Michael I. Jordan,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[48]  Refractor Vision , 2000, The Lancet.

[49]  I. Ibragimov,et al.  On the Composition of Unimodal Distributions , 1956 .

[50]  Shinichi Nakajima,et al.  Variational Bayes Solution of Linear Neural Networks and Its Generalization Performance , 2007, Neural Computation.

[51]  Sumio Watanabe,et al.  Asymptotic Equivalence of Bayes Cross Validation and Widely Applicable Information Criterion in Singular Learning Theory , 2010, J. Mach. Learn. Res..

[52]  David R. Karger,et al.  Tackling the Poor Assumptions of Naive Bayes Text Classifiers , 2003, ICML.

[53]  Erkki Oja,et al.  Independent Component Analysis , 2001 .

[54]  Chong Wang,et al.  Stochastic variational inference , 2012, J. Mach. Learn. Res..

[55]  Sumio Watanabe,et al.  Asymptotic analysis of Bayesian generalization error with Newton diagram , 2010, Neural Networks.

[56]  C. B. García,et al.  Determining All Solutions to Certain Systems of Nonlinear Equations , 1979, Math. Oper. Res..

[57]  Geoffrey E. Hinton,et al.  SMEM Algorithm for Mixture Models , 1998, Neural Computation.

[58]  Masa-aki Sato,et al.  Hierarchical Bayesian estimation for MEG inverse problem , 2004, NeuroImage.

[59]  Mark A. Girolami,et al.  A Variational Method for Learning Sparse and Overcomplete Representations , 2001, Neural Computation.

[60]  James Kelly,et al.  AutoClass: A Bayesian Classification System , 1993, ML.

[61]  S. Chib Marginal Likelihood from the Gibbs Output , 1995 .

[62]  Sumio Watanabe,et al.  Algebraic Analysis for Nonidentifiable Learning Machines , 2001, Neural Computation.

[63]  Jitendra Malik,et al.  Normalized Cuts and Image Segmentation , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[64]  Akimichi Takemura,et al.  Weights of $overline{\chi}{}\sp 2$ distribution for smooth or piecewise smooth cone alternatives , 1997 .

[65]  Kazuho Watanabe,et al.  Upper bound for variational free energy of Bayesian networks , 2008, Machine Learning.

[66]  Sumio Watanabe,et al.  Algebraic geometry and stochastic complexity of hidden Markov models , 2005, Neurocomputing.

[67]  Yi Ma,et al.  Robust principal component analysis? , 2009, JACM.

[68]  Kazuho Watanabe,et al.  Phase Transition of Variational Bayes Learning in Bernoulli Mixture , 2010, Aust. J. Intell. Inf. Process. Syst..

[69]  Tapani Raiko,et al.  Tkk Reports in Information and Computer Science Practical Approaches to Principal Component Analysis in the Presence of Missing Values Tkk Reports in Information and Computer Science Practical Approaches to Principal Component Analysis in the Presence of Missing Values , 2022 .

[70]  Florian Steinke,et al.  Bayesian Inference and Optimal Design in the Sparse Linear Model , 2007, AISTATS.

[71]  Sumio Watanabe,et al.  Stochastic complexities of reduced rank regression in Bayesian estimation , 2005, Neural Networks.

[72]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[73]  B. Love,et al.  The myth of computational level theory and the vacuity of rational analysis , 2011, Behavioral and Brain Sciences.

[74]  Sumio Watanabe,et al.  A widely applicable Bayesian information criterion , 2012, J. Mach. Learn. Res..

[75]  J. Justice Maximum entropy and bayesian methods in applied statistics , 1986 .

[76]  G. Reinsel,et al.  Multivariate Reduced-Rank Regression: Theory and Applications , 1998 .

[77]  Shinichi Nakajima,et al.  Variational Bayesian sparse additive matrix factorization , 2013, Machine Learning.

[78]  J. Tenenbaum,et al.  Special issue on “Probabilistic models of cognition , 2022 .

[79]  Kazuho Watanabe,et al.  Stochastic Complexities of Gaussian Mixtures in Variational Bayesian Approximation , 2006, J. Mach. Learn. Res..

[80]  J. Rissanen Stochastic Complexity and Modeling , 1986 .

[81]  I. Johnstone On the distribution of the largest eigenvalue in principal components analysis , 2001 .

[82]  David J. C. MacKay,et al.  Developments in Probabilistic Modelling with Neural Networks - Ensemble Learning , 1995, SNN Symposium on Neural Networks.

[83]  K. Fukumizu Likelihood ratio of unidentifiable models and multilayer neural networks , 2003 .

[84]  Geoffrey E. Hinton,et al.  Bayesian Learning for Neural Networks , 1995 .