Approximate Empirical Bayes for Deep Neural Networks

min W,a min Ωr,Ωc 1 2 |ŷ(W, a)− y|2 + λ||Ω1/2 r WΩ1/2 c ||F − λ (d log det(Ωr) + p log det(Ωc)) subject to uIp Ωr vIp, uId Ωc vId • Ωr := Σ−1 c and Ωc := Σ−1 c are the corresponding precision matrices. Block Coordinate Ascent Algorithm: Input: Initial value w0 := {a(0),W (0)}, Ω(0) r and Ω(0) c , first-order optimization algorithm A, constants 0 < u ≤ v. 1: for t = 1, . . . ,∞ until convergence do 2: Fix Ω(t−1) r , Ω(t−1) c , optimize w(t) by backpropagation and algorithm A 3: Ω(t) r ← InvThresholding(W (t)Ω(t−1) c W (t)T , d, u, v) 4: Ω(t) c ← InvThresholding(W (t)TΩ(t) r W (t), p, u, v) 5: end for

[1]  Han Zhao,et al.  Efficient Multitask Feature and Relationship Learning , 2017, UAI.

[2]  Thomas L. Griffiths,et al.  Recasting Gradient-Based Meta-Learning as Hierarchical Bayes , 2018, ICLR.

[3]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[4]  Trevor Hastie,et al.  Computer Age Statistical Inference by Bradley Efron , 2016 .

[5]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Ross B. Girshick,et al.  Reducing Overfitting in Deep Networks by Decorrelating Representations , 2015, ICLR.

[7]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[8]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[9]  Isaac Dialsingh,et al.  Large-scale inference: empirical Bayes methods for estimation, testing, and prediction , 2012 .

[10]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[11]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[12]  J. Neyman,et al.  INADMISSIBILITY OF THE USUAL ESTIMATOR FOR THE MEAN OF A MULTIVARIATE NORMAL DISTRIBUTION , 2005 .

[13]  Rich Caruana,et al.  Multitask Learning , 1997, Machine Learning.

[14]  John D. Storey,et al.  Empirical Bayes Analysis of a Microarray Experiment , 2001 .

[15]  Stefan Schaal,et al.  Locally Weighted Projection Regression: Incremental Real Time Learning in High Dimensional Space , 2000, ICML.

[16]  Rich Caruana,et al.  Overfitting in Neural Nets: Backpropagation, Conjugate Gradient, and Early Stopping , 2000, NIPS.

[17]  A. Rukhin Matrix Variate Distributions , 1999, The Multivariate Normal Distribution.

[18]  A. Rukhin Bayes and Empirical Bayes Methods for Data Analysis , 1997 .

[19]  David B. Dunson,et al.  Bayesian Data Analysis , 2010 .

[20]  C. Stein,et al.  Estimation with Quadratic Loss , 1992 .

[21]  Anders Krogh,et al.  A Simple Weight Decay Can Improve Generalization , 1991, NIPS.

[22]  B. Efron,et al.  Stein's Paradox in Statistics , 1977 .

[23]  B. Efron,et al.  Stein's Estimation Rule and Its Competitors- An Empirical Bayes Approach , 1973 .

[24]  H. Robbins An Empirical Bayes Approach to Statistics , 1956 .