Large Margin Deep Neural Networks: Theory and Algorithms

Deep neural networks (DNN) have achieved huge practical suc cess in recent years. However, its theoretical properties (in particular genera lization ability) are not yet very clear, since existing error bounds for neural networks cannot be directly used to explain the statistical behaviors of practically adopte d DNN models (which are multi-class in their nature and may contain convolutional l ayers). To tackle the challenge, we derive a new margin bound for DNN in this paper, in which the expected0-1 error of a DNN model is upper bounded by its empirical margin e rror plus a Rademacher Average based capacity term. This new boun d is very general and is consistent with the empirical behaviors of DNN models ob erved in our experiments. According to the new bound, minimizing the emp irical margin error can effectively improve the test performance of DNN. We ther efore propose large margin DNN algorithms, which impose margin penalty terms to the cross entropy loss of DNN, so as to reduce the margin error during the traini ng process. Experimental results show that the proposed algorithms can achiev e s gnificantly smaller empirical margin errors, as well as better test performance s than the standard DNN algorithm.

[1]  Bo Zhang,et al.  Max-Margin Deep Generative Models , 2015, NIPS.

[2]  Jürgen Schmidhuber,et al.  Multi-column deep neural networks for image classification , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Ah Chung Tsoi,et al.  Universal Approximation Using Feedforward Neural Networks: A Survey of Some Existing Methods, and Some New Results , 1998, Neural Networks.

[4]  Peter L. Bartlett,et al.  The Sample Complexity of Pattern Classification with Neural Networks: The Size of the Weights is More Important than the Size of the Network , 1998, IEEE Trans. Inf. Theory.

[5]  Razvan Pascanu,et al.  On the Number of Linear Regions of Deep Neural Networks , 2014, NIPS.

[6]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[7]  V. Koltchinskii,et al.  Empirical margin distributions and bounding the generalization error of combined classifiers , 2002, math/0405343.

[8]  Marek Karpinski,et al.  Polynomial bounds for VC dimension of sigmoidal neural networks , 1995, STOC '95.

[9]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[10]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[11]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[12]  Peter L. Bartlett,et al.  Almost Linear VC-Dimension Bounds for Piecewise Polynomial Networks , 1998, Neural Computation.

[13]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[14]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[15]  Paul W. Goldberg,et al.  Bounding the Vapnik-Chervonenkis Dimension of Concept Classes Parameterized by Real Numbers , 1993, COLT '93.

[16]  Franco Scarselli,et al.  On the Complexity of Neural Network Classifiers: A Comparison Between Shallow and Deep Architectures , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[17]  Hossein Mobahi,et al.  Deep Learning via Semi-supervised Embedding , 2012, Neural Networks: Tricks of the Trade.

[18]  Jason Weston,et al.  Deep learning via semi-supervised embedding , 2008, ICML '08.

[19]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[20]  Kurt Hornik,et al.  Multilayer feedforward networks are universal approximators , 1989, Neural Networks.

[21]  Rich Caruana,et al.  Do Deep Nets Really Need to be Deep? , 2013, NIPS.

[22]  John B. Shoven,et al.  I , Edinburgh Medical and Surgical Journal.

[23]  Patrice Y. Simard,et al.  Best practices for convolutional neural networks applied to visual document analysis , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[24]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[25]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..