Learning Structures for Deep Neural Networks

In this paper, we study the automatic structure learning for deep neural networks (DNN), motivated by the observations that the performance of a deep neural network is highly sensitive to its structure and previous successes of DNN heavily depend on human experts to design the network structures. We focus on the unsupervised setting for structure learning and propose to adopt the efficient coding principle, rooted in information theory and developed in computational neuroscience, to guide the procedure of structure learning without label information. This principle suggests that a good network structure should maximize the mutual information between inputs and outputs, or equivalently maximize the entropy of outputs under mild assumptions. We further establish connections between this principle and the theory of Bayesian optimal classification, and empirically verify that larger entropy of the outputs of a deep neural network indeed corresponds to a better classification accuracy. Then as an implementation of the principle, we show that sparse coding can effectively maximize the entropy of the output signals, and accordingly design an algorithm based on global group sparse coding to automatically learn the inter-layer connection and determine the depth of a neural network. Our experiments on a public image classification dataset demonstrate that using the structure learned from scratch by our proposed algorithm, one can achieve a classification accuracy comparable to the best expertdesigned structure (i.e., convolutional neural networks (CNN)). In addition, our proposed algorithm successfully discovers the local connecWork carried out at MSRA, once submitted to International Conference on Machine Learning in 2014, Beijing, China. Copyright 2014 by the author(s). tivity (corresponding to local receptive fields in CNN) and invariance structure (corresponding to pulling in CNN), as well as achieves a good tradeoff between marginal performance gain and network depth. All of this indicates the power of the efficient coding principle, and the effectiveness of automatic structure learning.

[1]  Shun-ichi Amari,et al.  Adaptive Online Learning Algorithms for Blind Separation: Maximum Entropy and Minimum Mutual Information , 1997, Neural Computation.

[2]  Marvin Minsky,et al.  Steps toward Artificial Intelligence , 1995, Proceedings of the IRE.

[3]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[4]  L. Györfi,et al.  Nonparametric entropy estimation. An overview , 1997 .

[5]  Geoffrey E. Hinton,et al.  Deep Belief Networks for phone recognition , 2009 .

[6]  Yihong Gong,et al.  Linear spatial pyramid matching using sparse coding for image classification , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Kunihiko Fukushima,et al.  Neocognitron: A new algorithm for pattern recognition tolerant of deformations and shifts in position , 1982, Pattern Recognit..

[8]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[9]  Ralph Linsker,et al.  Self-organization in a perceptual network , 1988, Computer.

[10]  J. Mairal Sparse coding for machine learning, image processing and computer vision , 2010 .

[11]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[12]  Simon Haykin,et al.  GradientBased Learning Applied to Document Recognition , 2001 .

[13]  Aapo Hyvärinen,et al.  Topographic Independent Component Analysis , 2001, Neural Computation.

[14]  Yann LeCun,et al.  Regularization of Neural Networks using DropConnect , 2013, ICML.

[15]  Terrence J. Sejnowski,et al.  An Information-Maximization Approach to Blind Separation and Blind Deconvolution , 1995, Neural Computation.

[16]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[17]  John Wright,et al.  ReduNet: A White-box Deep Network from the Principle of Maximizing Rate Reduction , 2021, ArXiv.

[18]  David J. Field,et al.  What Is the Goal of Sensory Coding? , 1994, Neural Computation.

[19]  J. Nadal,et al.  Nonlinear neurons in the low-noise limit: a factorial code maximizes information transfer Network 5 , 1994 .

[20]  Yann LeCun,et al.  Learning Fast Approximations of Sparse Coding , 2010, ICML.

[21]  D. Hubel,et al.  Receptive fields, binocular interaction and functional architecture in the cat's visual cortex , 1962, The Journal of physiology.

[22]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[23]  Yann LeCun,et al.  What is the best multi-stage architecture for object recognition? , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[24]  Erkki Oja,et al.  Independent component analysis: algorithms and applications , 2000, Neural Networks.

[25]  Naftali Tishby,et al.  Deep learning and the information bottleneck principle , 2015, 2015 IEEE Information Theory Workshop (ITW).

[26]  A. Kraskov,et al.  Estimating mutual information. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[27]  Dong Yu,et al.  Deep Convex Net: A Scalable Architecture for Speech Pattern Classification , 2011, INTERSPEECH.

[28]  H. B. Barlow,et al.  Possible Principles Underlying the Transformations of Sensory Messages , 2012 .

[29]  David B. Dunson,et al.  Deep Learning with Hierarchical Convolutional Factor Analysis , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Marc'Aurelio Ranzato,et al.  Building high-level features using large scale unsupervised learning , 2011, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.