Variational Bayes Solution of Linear Neural Networks and Its Generalization Performance

It is well known that in unidentifiable models, the Bayes estimation provides much better generalization performance than the maximum likelihood (ML) estimation. However, its accurate approximation by Markov chain Monte Carlo methods requires huge computational costs. As an alternative, a tractable approximation method, called the variational Bayes (VB) approach, has recently been proposed and has been attracting attention. Its advantage over the expectation maximization (EM) algorithm, often used for realizing the ML estimation, has been experimentally shown in many applications; nevertheless, it has not yet been theoretically shown. In this letter, through analysis of the simplest unidentifiable models, we theoretically show some properties of the VB approach. We first prove that in three-layer linear neural networks, the VB approach is asymptotically equivalent to a positive-part James-Stein type shrinkage estimation. Then we theoretically clarify its free energy, generalization error, and training error. Comparing them with those of the ML estimation and the Bayes estimation, we discuss the advantage of the VB approach. We also show that unlike in the Bayes estimation, the free energy and the generalization error are less simply related with each other and that in typical cases, the VB free energy well approximates the Bayes one, while the VB generalization error significantly differs from the Bayes one.

[1]  Kazuho Watanabe,et al.  Stochastic Complexities of Gaussian Mixtures in Variational Bayesian Approximation , 2006, J. Mach. Learn. Res..

[2]  Kazuho Watanabe,et al.  Stochastic Complexity for Mixture of Exponential Families in Variational Bayes , 2005, ALT.

[3]  Esther Levin,et al.  A statistical approach to learning and generalization in layered neural networks , 1989, COLT '89.

[4]  H. Cramér Mathematical methods of statistics , 1947 .

[5]  Sumio Watanabe,et al.  Stochastic complexities of reduced rank regression in Bayesian estimation , 2005, Neural Networks.

[6]  Sumio Watanabe,et al.  Stochastic Complexity of Bayesian Networks , 2003, UAI.

[7]  E. Gassiat,et al.  Testing in locally conic models, and application to mixture models , 1997 .

[8]  K. Wachter The Strong Limits of Random Matrix Spectra for Sample Matrices of Independent Elements , 1978 .

[9]  Neil H. Timm,et al.  Multivariate Reduced-Rank Regression , 1999, Technometrics.

[10]  Hagai Attias,et al.  Inferring Parameters and Structure of Latent Variable Models by Variational Bayes , 1999, UAI.

[11]  Keisuke Yamazaki,et al.  A New Method of Model Selection Based on Learning Coecien t , 2005 .

[12]  David J. C. MacKay,et al.  Developments in Probabilistic Modelling with Neural Networks - Ensemble Learning , 1995, SNN Symposium on Neural Networks.

[13]  Kurt Hornik,et al.  Learning in linear neural networks: a survey , 1995, IEEE Trans. Neural Networks.

[14]  H. Akaike A new look at the statistical model identification , 1974 .

[15]  David J. C. MacKay,et al.  Bayesian Interpolation , 1992, Neural Computation.

[16]  J. Neyman,et al.  INADMISSIBILITY OF THE USUAL ESTIMATOR FOR THE MEAN OF A MULTIVARIATE NORMAL DISTRIBUTION , 2005 .

[17]  Sumio Watanabe,et al.  Stochastic complexities of hidden Markov models , 2003, 2003 IEEE XIII Workshop on Neural Networks for Signal Processing (IEEE Cat. No.03TH8718).

[18]  Keisuke Yamazaki,et al.  Resolution of singularities in mixture models and its stochastic complexity , 2002, Proceedings of the 9th International Conference on Neural Information Processing, 2002. ICONIP '02..

[19]  P. Bickel Asymptotic distribution of the likelihood ratio statistic in a prototypical non regular problem , 1993 .

[20]  J. Hartigan A failure of likelihood asymptotics for normal mixtures , 1985 .

[21]  Bo Wang,et al.  Convergence and Asymptotic Normality of Variational Bayesian Approximations for Expon , 2004, UAI.

[22]  C. Stein,et al.  Estimation with Quadratic Loss , 1992 .

[23]  Sumio Watanabe,et al.  Algebraic Analysis for Nonidentifiable Learning Machines , 2001, Neural Computation.

[24]  Shun-ichi Amari,et al.  Learning Coefficients of Layered Models When the True Distribution Mismatches the Singularities , 2003, Neural Computation.

[25]  T. Hosino,et al.  Stochastic complexity of variational Bayesian hidden Markov models , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[26]  Esther Levin,et al.  A statistical approach to learning and generalization in layered neural networks , 1989, Proc. IEEE.

[27]  Michael I. Jordan,et al.  Bayesian parameter estimation via variational methods , 2000, Stat. Comput..

[28]  Matthew J. Beal,et al.  Graphical Models and Variational Methods , 2001 .

[29]  Shinichi Nakajima,et al.  Generalization Error of Linear Neural Networks in an Empirical Bayes Approach , 2005, IJCAI.

[30]  Katsuyuki Hagiwara On the Problem in Model Selection of Neural Network Regression in Overrealizable Scenario , 2002, Neural Computation.

[31]  Akimichi Takemura,et al.  Tail probabilities of the maxima of multilinear forms and their applications , 2001 .

[32]  B. Efron,et al.  Stein's Estimation Rule and Its Competitors- An Empirical Bayes Approach , 1973 .

[33]  Hirotugu Akaike,et al.  Likelihood and the Bayes procedure , 1980 .

[34]  Miki Aoyagi,et al.  Desingularization and the Generalization Error of Reduced Rank Regression in Bayesian Estimation , 2004 .

[35]  Geoffrey E. Hinton,et al.  Keeping the neural networks simple by minimizing the description length of the weights , 1993, COLT '93.

[36]  Dan Geiger,et al.  Asymptotic Model Selection for Naive Bayesian Networks , 2002, J. Mach. Learn. Res..

[37]  Keisuke Yamazaki,et al.  A New Method of Model Selection Based on Learning Coefficient , .

[38]  G. Reinsel,et al.  Multivariate Reduced-Rank Regression: Theory and Applications , 1998 .

[39]  S. Amari,et al.  Singularities Affect Dynamics of Learning in Neuromanifolds , 2006, Neural Computation.

[40]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[41]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[42]  K. Fukumizu Generalization Error of Linear Neural Networks in Unidentiable Cases , 1999 .

[43]  Sumio Watanabe Algebraic Information Geometry for Learning Machines with Singularities , 2000, NIPS.

[44]  K. Fukumizu Likelihood ratio of unidentifiable models and multilayer neural networks , 2003 .

[45]  Masa-aki Sato,et al.  Online Model Selection Based on the Variational Bayes , 2001, Neural Computation.

[46]  Akimichi Takemura,et al.  Weights of $overline{\chi}{}\sp 2$ distribution for smooth or piecewise smooth cone alternatives , 1997 .

[47]  J. Rissanen Stochastic Complexity and Modeling , 1986 .

[48]  Shinichi Nakajima,et al.  Generalization Performance of Subspace Bayes Approach in Linear Neural Networks , 2006, IEICE Trans. Inf. Syst..