Relative Fisher Information and Natural Gradient for Learning Large Modular Models

Fisher information and natural gradient provided deep insights and powerful tools to artificial neural networks. However related analysis becomes more and more difficult as the learner's structure turns large and complex. This paper makes a preliminary step towards anew direction. We extract a local component from a large neural system, and define its relative Fisher information metric that describes accurately this small component, and is invariant to the other parts of the system. This concept is important because the geometry structure is much simplified and it can be easily applied to guide the learning of neural networks. We provide an analysis on a list of commonly used components, and demonstrate how to use this concept to further improve optimization.

[1]  M. Fréchet Sur l'extension de certaines evaluations statistiques au cas de petits echantillons , 1943 .

[2]  H. Cramér Mathematical methods of statistics , 1947 .

[3]  V. S. Huzurbazar Probability distributions and orthogonal parameters , 1950, Mathematical Proceedings of the Cambridge Philosophical Society.

[4]  B. Efron,et al.  Assessing the accuracy of the maximum likelihood estimator: Observed versus expected Fisher information , 1978 .

[5]  L. Cobb,et al.  Estimation and Moment Recursion Relations for Multimodal Distributions of the Exponential Family , 1983 .

[6]  D. Cox,et al.  Parameter Orthogonality and Approximate Conditional Inference , 1987 .

[7]  C. R. Rao,et al.  Information and the Accuracy Attainable in the Estimation of Statistical Parameters , 1992 .

[8]  Takio Kurita,et al.  Iterative weighted least squares algorithms for neural networks classifiers , 1992, New Generation Computing.

[9]  Shun-ichi Amari,et al.  Information geometry of the EM and em algorithms for neural networks , 1995, Neural Networks.

[10]  J. Jost Riemannian geometry and geometric analysis , 1995 .

[11]  Shun-ichi Amari,et al.  Neural Learning in Structured Parameter Spaces - Natural Riemannian Gradient , 1996, NIPS.

[12]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[13]  Shun-ichi Amari,et al.  Methods of information geometry , 2000 .

[14]  N. Čencov Statistical Decision Rules and Optimal Inference , 2000 .

[15]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[16]  Nicolas Le Roux,et al.  Topmoumoute Online Natural Gradient Algorithm , 2007, NIPS.

[17]  J. Vickers,et al.  Block diagonalization of four-dimensional metrics , 2008, 0809.3327.

[18]  渡邊 澄夫 Algebraic geometry and statistical learning theory , 2009 .

[19]  James Martens,et al.  Deep learning via Hessian-free optimization , 2010, ICML.

[20]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[21]  Virginia Vassilevska Williams,et al.  Multiplying matrices faster than coppersmith-winograd , 2012, STOC '12.

[22]  Klaus-Robert Müller,et al.  Deep Boltzmann Machines and the Centering Trick , 2012, Neural Networks: Tricks of the Trade.

[23]  Tapani Raiko,et al.  Deep Learning Made Easier by Linear Transformations in Perceptrons , 2012, AISTATS.

[24]  Yoshua Bengio,et al.  Estimating or Propagating Gradients Through Stochastic Neurons , 2013, ArXiv.

[25]  Silvere Bonnabel,et al.  Stochastic Gradient Descent on Riemannian Manifolds , 2011, IEEE Transactions on Automatic Control.

[26]  Yann Ollivier,et al.  Riemannian metrics for neural networks , 2013, ArXiv.

[27]  Sida I. Wang,et al.  Dropout Training as Adaptive Regularization , 2013, NIPS.

[28]  Frank Nielsen,et al.  Cramer-Rao Lower Bound and Information Geometry , 2013, ArXiv.

[29]  Andrea Montanari,et al.  Computational Implications of Reducing Data to Sufficient Statistics , 2014, ArXiv.

[30]  Philip Thomas,et al.  GeNGA: A Generalization of Natural Gradient Ascent with Positive and Negative Convergence Results , 2014, ICML.

[31]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[32]  James Martens,et al.  New perspectives on the natural gradient method , 2014, ArXiv.

[33]  Ke Sun,et al.  An Information Geometry of Statistical Manifold Learning , 2014, ICML.

[34]  Razvan Pascanu,et al.  Revisiting Natural Gradient for Deep Networks , 2013, ICLR.

[35]  Pablo Zegers,et al.  Fisher Information Properties , 2015, Entropy.

[36]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[37]  Sayan Mukherjee,et al.  The Information Geometry of Mirror Descent , 2013, IEEE Transactions on Information Theory.

[38]  J. Lafferty,et al.  Riemannian Geometry and Statistical Machine Learning , 2015 .

[39]  Razvan Pascanu,et al.  Natural Neural Networks , 2015, NIPS.

[40]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[41]  Roger B. Grosse,et al.  Optimizing Neural Networks with Kronecker-factored Approximate Curvature , 2015, ICML.

[42]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[43]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Shun-ichi Amari,et al.  Information Geometry and Its Applications , 2016 .

[45]  Bruno Castro da Silva,et al.  Energetic Natural Gradient Descent , 2016, ICML.

[46]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[47]  Yann Ollivier,et al.  Practical Riemannian Neural Networks , 2016, ArXiv.

[48]  Sepp Hochreiter,et al.  Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) , 2015, ICLR.

[49]  Peter Norvig,et al.  Deep Learning with Dynamic Computation Graphs , 2017, ICLR.

[50]  François-Xavier Vialard,et al.  An Interpolating Distance Between Optimal Transport and Fisher–Rao Metrics , 2010, Foundations of Computational Mathematics.

[51]  James G. Dowty,et al.  Chentsov’s theorem for exponential families , 2017, Information Geometry.