Universum Prescription: Regularization Using Unlabeled Data

This paper shows that simply prescribing "none of the above" labels to unlabeled data has a beneficial regularization effect to supervised learning. We call it universum prescription by the fact that the prescribed labels cannot be one of the supervised labels. In spite of its simplicity, universum prescription obtained competitive results in training deep convolutional networks for CIFAR-10, CIFAR-100, STL-10 and ImageNet datasets. A qualitative justification of these approaches using Rademacher complexity is presented. The effect of a regularization parameter -- probability of sampling from unlabeled data -- is also studied empirically.

[1]  Yoshua Bengio,et al.  Scaling learning algorithms towards AI , 2007 .

[2]  Yoshua Bengio,et al.  Maxout Networks , 2013, ICML.

[3]  Michael C. Mozer,et al.  Skeletonization: A Technique for Trimming the Fat from a Network via Relevance Assessment , 1988, NIPS.

[4]  Jason Weston,et al.  Inference with the Universum , 2006, ICML.

[5]  Xiaojin Zhu,et al.  Introduction to Semi-Supervised Learning , 2009, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[6]  Rich Caruana,et al.  Model compression , 2006, KDD '06.

[7]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[8]  Thomas Hofmann,et al.  Efficient Learning of Sparse Representations with an Energy-Based Model , 2007 .

[9]  Bernhard Schölkopf,et al.  An Analysis of Inference with the Universum , 2007, NIPS.

[10]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[11]  Xiang Zhang,et al.  OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks , 2013, ICLR.

[12]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, STOC '84.

[13]  Rich Caruana,et al.  Multitask Learning: A Knowledge-Based Source of Inductive Bias , 1993, ICML.

[14]  Benjamin Graham,et al.  Spatially-sparse convolutional neural networks , 2014, ArXiv.

[15]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[16]  Honglak Lee,et al.  An Analysis of Single-Layer Networks in Unsupervised Feature Learning , 2011, AISTATS.

[17]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[18]  Yaser S. Abu-Mostafa,et al.  Learning from hints in neural networks , 1990, J. Complex..

[19]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[20]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[21]  V. Vapnik Estimation of Dependences Based on Empirical Data , 2006 .

[22]  Vladimir Koltchinskii,et al.  Rademacher penalties and structural risk minimization , 2001, IEEE Trans. Inf. Theory.

[23]  Qiang Chen,et al.  Network In Network , 2013, ICLR.

[24]  Peter L. Bartlett,et al.  Model Selection and Error Estimation , 2000, Machine Learning.

[25]  Boris Polyak Some methods of speeding up the convergence of iteration methods , 1964 .

[26]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[27]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[28]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Tapani Raiko,et al.  Semi-supervised Learning with Ladder Networks , 2015, NIPS.

[30]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[31]  Steven C. Suddarth,et al.  Symbolic-Neural Systems and the Use of Hints for Developing Complex Systems , 1991, Int. J. Man Mach. Stud..

[32]  Xinyun Chen Under Review as a Conference Paper at Iclr 2017 Delving into Transferable Adversarial Ex- Amples and Black-box Attacks , 2016 .

[33]  Elie Bienenstock,et al.  Neural Networks and the Bias/Variance Dilemma , 1992, Neural Computation.

[34]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[35]  Alexander Zien,et al.  Data-Dependent Regularization , 2006 .

[36]  Zhuowen Tu,et al.  Deeply-Supervised Nets , 2014, AISTATS.

[37]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[38]  Lawrence D. Jackel,et al.  Backpropagation Applied to Handwritten Zip Code Recognition , 1989, Neural Computation.

[39]  Zhi-Hua Zhou,et al.  Exploiting unlabeled data to enhance ensemble diversity , 2009, 2010 IEEE International Conference on Data Mining.

[40]  Yann LeCun,et al.  Stacked What-Where Auto-encoders , 2015, ArXiv.

[41]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Fu Jie Huang,et al.  A Tutorial on Energy-Based Learning , 2006 .

[43]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[44]  Davide Anguita,et al.  The Impact of Unlabeled Patterns in Rademacher Complexity Theory for Kernel Classifiers , 2011, NIPS.

[45]  Rob Fergus,et al.  Stochastic Pooling for Regularization of Deep Convolutional Neural Networks , 2013, ICLR.

[46]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[47]  P. Massart Some applications of concentration inequalities to statistics , 2000 .

[48]  Antonio Torralba,et al.  Ieee Transactions on Pattern Analysis and Machine Intelligence 1 80 Million Tiny Images: a Large Dataset for Non-parametric Object and Scene Recognition , 2022 .

[49]  Yann LeCun,et al.  Regularization of Neural Networks using DropConnect , 2013, ICML.

[50]  R. Dudley The Sizes of Compact Subsets of Hilbert Space and Continuity of Gaussian Processes , 1967 .

[51]  Bernhard Schölkopf,et al.  A Discussion of Semi-Supervised Learning and Transduction , 2006, Semi-Supervised Learning.

[52]  R. Serfling Probability Inequalities for the Sum in Sampling without Replacement , 1974 .

[53]  Alexander Gammerman,et al.  Learning by Transduction , 1998, UAI.

[54]  Jonathan Baxter,et al.  A Model of Inductive Bias Learning , 2000, J. Artif. Intell. Res..

[55]  Xianguo Zhang PAC-Learning for Energy-based Models , 2022 .

[56]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[57]  Bernhard Schölkopf,et al.  Semi-Supervised Learning (Adaptive Computation and Machine Learning) , 2006 .

[58]  Davide Anguita,et al.  Local Rademacher Complexity: Sharper risk bounds with and without unlabeled samples , 2015, Neural Networks.

[59]  C A Nelson,et al.  Learning to Learn , 2017, Encyclopedia of Machine Learning and Data Mining.

[60]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[61]  Yoshua Bengio,et al.  Why Does Unsupervised Pre-training Help Deep Learning? , 2010, AISTATS.

[62]  V. Vapnik Estimation of Dependences Based on Empirical Data , 2006 .

[63]  V. Koltchinskii,et al.  Rademacher Processes and Bounding the Risk of Function Learning , 2004, math/0405338.

[64]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.