Knowledge Matters: Importance of Prior Information for Optimization

We explore the effect of introducing prior information into the intermediate level of neural networks for a learning task on which all the state-of-the-art machine learning algorithms tested failed to learn. We motivate our work from the hypothesis that humans learn such intermediate concepts from other individuals via a form of supervision or guidance using a curriculum. The experiments we have conducted provide positive evidence in favor of this hypothesis. In our experiments, a two-tiered MLP architecture is trained on a dataset with 64x64 binary inputs images, each image with three sprites. The final task is to decide whether all the sprites are the same or one of them is different. Sprites are pentomino tetris shapes and they are placed in an image with different locations using scaling and rotation transformations. The first part of the two-tiered MLP is pre-trained with intermediate-level targets being the presence of sprites at each location, while the second part takes the output of the first part as input and predicts the final task's target binary event. The two-tiered MLP architecture, with a few tens of thousand examples, was able to learn the task perfectly, whereas all other algorithms (include unsupervised pre-training, but also traditional algorithms like SVMs, decision trees and boosting) all perform no better than chance. We hypothesize that the optimization difficulty involved when the intermediate pre-training is not performed is due to the {\em composition} of two highly non-linear tasks. Our findings are also consistent with hypotheses on cultural learning inspired by the observations of optimization problems with deep learning, presumably because of effective local minima.

[1]  R. Lewontin ‘The Selfish Gene’ , 1977, Nature.

[2]  László Dezsö,et al.  Universal Grammar , 1981, Certainty in Action.

[3]  S A Roe,et al.  The generative process. , 1983, Science.

[4]  M. Henneberg Decrease of human skull size in the Holocene. , 1988, Human biology.

[5]  R. Solomonoff A SYSTEM FOR INCREMENTAL LEARNING BASED ON ALGORITHMIC PROBABILITY , 1989 .

[6]  Ray J. Solomonofi,et al.  A SYSTEM FOR INCREMENTAL LEARNING BASED ON ALGORITHMIC PROBABILITY , 1989 .

[7]  Sebastian Thrun,et al.  Explanation-Based Neural Network Learning for Robot Control , 1992, NIPS.

[8]  M. Henneberg,et al.  Trends in cranial capacity and cranial index in Subsaharan Africa during the Holocene , 1993, American journal of human biology : the official journal of the Human Biology Council.

[9]  Jude W. Shavlik,et al.  Knowledge-Based Artificial Neural Networks , 1994, Artif. Intell..

[10]  Joseph O'Sullivan Integrating Initialization Bias and Search Bias in Neural Network Learning , 1996 .

[11]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[12]  B. E. Eckbo,et al.  Appendix , 1826, Epilepsy Research.

[13]  Sridhar Mahadevan,et al.  Recent Advances in Hierarchical Reinforcement Learning , 2003, Discret. Event Dyn. Syst..

[14]  John Langford,et al.  CAPTCHA: Using Hard AI Problems for Security , 2003, EUROCRYPT.

[15]  J. Henrich,et al.  The evolution of cultural evolution , 2003 .

[16]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[17]  G. Peterson A day of great illumination: B. F. Skinner's discovery of shaping. , 2004, Journal of the experimental analysis of behavior.

[18]  Yoshua Bengio,et al.  Greedy Layer-Wise Training of Deep Networks , 2006, NIPS.

[19]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[20]  Tom M. Mitchell,et al.  The Need for Biases in Learning Generalizations , 2007 .

[21]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[22]  Chih-Jen Lin,et al.  A Practical Guide to Support Vector Classication , 2008 .

[23]  Jason Weston,et al.  Deep learning via semi-supervised embedding , 2008, ICML '08.

[24]  Yoshua Bengio,et al.  Exploring Strategies for Training Deep Neural Networks , 2009, J. Mach. Learn. Res..

[25]  Yann LeCun,et al.  What is the best multi-stage architecture for object recognition? , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[26]  Geoffrey E. Hinton,et al.  Deep Boltzmann Machines , 2009, AISTATS.

[27]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[28]  P. Dayan,et al.  Flexible shaping: How learning in small steps helps , 2009, Cognition.

[29]  Yoshua Bengio,et al.  Why Does Unsupervised Pre-training Help Deep Learning? , 2010, AISTATS.

[30]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[31]  Jason Weston,et al.  A user's guide to support vector machines. , 2010, Methods in molecular biology.

[32]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[33]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[34]  Luca Maria Gambardella,et al.  Deep, Big, Simple Neural Nets for Handwritten Digit Recognition , 2010, Neural Computation.

[35]  Gautam Kunapuli,et al.  The Adviceptron: Giving Advice to the Perceptron , 2010 .

[36]  Pascal Vincent,et al.  Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010, J. Mach. Learn. Res..

[37]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[38]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[39]  Pascal Vincent,et al.  Contractive Auto-Encoders: Explicit Invariance During Feature Extraction , 2011, ICML.

[40]  Ting Li,et al.  Comparing machines and humans on a visual categorization test , 2011, Proceedings of the National Academy of Sciences.

[41]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[42]  Yoshua Bengio,et al.  Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[43]  Bilge Mutlu,et al.  How Do Humans Teach: On Curriculum Learning and Teaching Dimension , 2011, NIPS.

[44]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[45]  Yoshua Bengio,et al.  Practical Recommendations for Gradient-Based Training of Deep Architectures , 2012, Neural Networks: Tricks of the Trade.

[46]  Yoshua Bengio,et al.  Evolving Culture vs Local Minima , 2012, ArXiv.

[47]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[48]  Yoshua Bengio,et al.  A Generative Process for sampling Contractive Auto-Encoders , 2012, ICML 2012.

[49]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[50]  Tapani Raiko,et al.  Deep Learning Made Easier by Linear Transformations in Perceptrons , 2012, AISTATS.

[51]  Pascal Vincent,et al.  Unsupervised Feature Learning and Deep Learning: A Review and New Perspectives , 2012, ArXiv.

[52]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[53]  Yoshua Bengio,et al.  Big Neural Networks Waste Capacity , 2013, ICLR.

[54]  Razvan Pascanu,et al.  Learned-norm pooling for deep neural networks , 2013, ArXiv.

[55]  Yoshua Bengio,et al.  Maxout Networks , 2013, ICML.

[56]  Tapani Raiko,et al.  Pushing Stochastic Gradient towards Second-Order Methods -- Backpropagation Learning with Transformations in Nonlinearities , 2013, ICLR.

[57]  Yann LeCun,et al.  The Loss Surface of Multilayer Networks , 2014, ArXiv.

[58]  Surya Ganguli,et al.  Identifying and attacking the saddle point problem in high-dimensional non-convex optimization , 2014, NIPS.

[59]  Yann LeCun,et al.  The Loss Surfaces of Multilayer Networks , 2014, AISTATS.

[60]  Peter Kulchyski and , 2015 .