Learned-Norm Pooling for Deep Feedforward and Recurrent Neural Networks

In this paper we propose and investigate a novel nonlinear unit, called L p unit, for deep neural networks. The proposed L p unit receives signals from several projections of a subset of units in the layer below and computes a normalized L p norm. We notice two interesting interpretations of the L p unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the L p unit is, to a certain degree, similar to the recently proposed maxout unit [13] which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the L p unit is more efficient at representing complex, nonlinear separating boundaries. Each L p unit defines a superelliptic boundary, with its exact shape defined by the order p. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few L p units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed L p units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the L p units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed L p unit on the recently proposed deep recurrent neural networks (RNN).

[1]  J. Orbach Principles of Neurodynamics. Perceptrons and the Theory of Brain Mechanisms. , 1962 .

[2]  D. Hubel,et al.  Receptive fields and functional architecture of monkey striate cortex , 1968, The Journal of physiology.

[3]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[4]  Geoffrey E. Hinton,et al.  Learning representations by back-propagation errors, nature , 1986 .

[5]  Geoffrey E. Hinton,et al.  Learning representations of back-propagation errors , 1986 .

[6]  Kurt Hornik,et al.  Multilayer feedforward networks are universal approximators , 1989, Neural Networks.

[7]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[8]  T. Poggio,et al.  Hierarchical models of object recognition in cortex , 1999, Nature Neuroscience.

[9]  Kunihiko Fukushima,et al.  Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position , 1980, Biological Cybernetics.

[10]  Ronald,et al.  Learning representations by backpropagating errors , 2004 .

[11]  Marc'Aurelio Ranzato,et al.  Unsupervised Learning of Invariant Feature Hierarchies with Applications to Object Recognition , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  A. Hyvärinen,et al.  Complex cell pooling and the statistics of natural images , 2007, Network.

[13]  M. Trebar,et al.  Application of distributed SVM architectures in classifying forest data cover types , 2008 .

[14]  Yihong Gong,et al.  Linear spatial pyramid matching using sparse coding for image classification , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Yann LeCun,et al.  What is the best multi-stage architecture for object recognition? , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[16]  Jean Ponce,et al.  A Theoretical Analysis of Feature Pooling in Visual Recognition , 2010, ICML.

[17]  Razvan Pascanu,et al.  Theano: A CPU and GPU Math Compiler in Python , 2010, SciPy.

[18]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[19]  Simon Haykin,et al.  Neural Networks and Learning Machines , 2010 .

[20]  Yoshua Bengio,et al.  Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[21]  Pascal Vincent,et al.  The Manifold Tangent Classifier , 2011, NIPS.

[22]  Yoshua Bengio,et al.  Suitability of V1 Energy Models for Object Classification , 2011, Neural Computation.

[23]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[24]  Yoshua Bengio,et al.  Random Search for Hyper-Parameter Optimization , 2012, J. Mach. Learn. Res..

[25]  Yoshua Bengio,et al.  Modeling Temporal Dependencies in High-Dimensional Sequences: Application to Polyphonic Music Generation and Transcription , 2012, ICML.

[26]  Jürgen Schmidhuber,et al.  Multi-column deep neural network for traffic sign classification , 2012, Neural Networks.

[27]  Razvan Pascanu,et al.  Theano: new features and speed improvements , 2012, ArXiv.

[28]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[29]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[30]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[31]  Pascal Vincent,et al.  Disentangling Factors of Variation for Facial Expression Recognition , 2012, ECCV.

[32]  Yoshua Bengio,et al.  Deep Learning of Representations , 2013, Handbook on Neural Information Processing.

[33]  Geoffrey E. Hinton,et al.  Modeling Natural Images Using Gated MRFs , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  Razvan Pascanu,et al.  Learned-norm pooling for deep neural networks , 2013, ArXiv.

[35]  Ian J. Goodfellow,et al.  Pylearn2: a machine learning research library , 2013, ArXiv.

[36]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.

[37]  Yoshua Bengio,et al.  High-dimensional sequence transduction , 2012, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[38]  Yoshua Bengio,et al.  Maxout Networks , 2013, ICML.

[39]  Razvan Pascanu,et al.  How to Construct Deep Recurrent Neural Networks , 2013, ICLR.

[40]  Christian Osendorfer,et al.  On Fast Dropout and its Applicability to Recurrent Networks , 2013, ICLR.

[41]  Razvan Pascanu,et al.  Revisiting Natural Gradient for Deep Networks , 2013, ICLR.

[42]  Yoshua Bengio,et al.  Knowledge Matters: Importance of Prior Information for Optimization , 2013, J. Mach. Learn. Res..