Curriculum Learning by Transfer Learning: Theory and Experiments with Deep Networks

Our first contribution in this paper is a theoretical investigation of curriculum learning in the context of stochastic gradient descent when optimizing the least squares loss function. We prove that the rate of convergence of an ideal curriculum learning method in monotonically increasing with the difficulty of the examples, and that this increase in convergence rate is monotonically decreasing as training proceeds. In our second contribution we analyze curriculum learning in the context of training a CNN for image classification. Here one crucial problem is the means to achieve a curriculum. We describe a method which infers the curriculum by way of transfer learning from another network, pre-trained on a different task. While this approach can only approximate the ideal curriculum, we observe empirically similar behavior to the one predicted by the theory, namely, a significant boost in convergence speed at the beginning of training. When the task is made more difficult, improvement in generalization performance is observed. Finally, curriculum learning exhibits robustness against unfavorable conditions such as strong regularization.

[1]  Michael Kearns,et al.  On the complexity of teaching , 1991, COLT '91.

[2]  Tom M. Mitchell,et al.  The Need for Biases in Learning Generalizations , 2007 .

[3]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[4]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[5]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[6]  Kai A. Krueger,et al.  Flexible shaping: How learning in small steps helps , 2009, Cognition.

[7]  Daphne Koller,et al.  Self-Paced Learning for Latent Variable Models , 2010, NIPS.

[8]  Honglak Lee,et al.  An Analysis of Single-Layer Networks in Unsupervised Feature Learning , 2011, AISTATS.

[9]  Bilge Mutlu,et al.  How Do Humans Teach: On Curriculum Learning and Teaching Dimension , 2011, NIPS.

[10]  Wojciech Zaremba,et al.  Learning to Execute , 2014, ArXiv.

[11]  Stefan Carlsson,et al.  CNN Features Off-the-Shelf: An Astounding Baseline for Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[12]  Joan Bruna,et al.  Intriguing properties of neural networks , 2013, ICLR.

[13]  Shiguang Shan,et al.  Self-Paced Curriculum Learning , 2015, AAAI.

[14]  Lisa Anne Hendricks,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2015, CVPR.

[15]  Garrison W. Cottrell,et al.  Basic Level Categorization Facilitates Visual Object Recognition , 2015, ArXiv.

[16]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Yixin Chen,et al.  Compressing Neural Networks with the Hashing Trick , 2015, ICML.

[18]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Eunhyeok Park,et al.  Compression of Deep Convolutional Neural Networks for Fast and Low Power Mobile Applications , 2015, ICLR.

[20]  Abhinav Gupta,et al.  Training Region-Based Object Detectors with Online Hard Example Mining , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Greg Mori,et al.  Learning Structured Inference Neural Networks with Label Relations , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[23]  Zoubin Ghahramani,et al.  Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[24]  Alex Graves,et al.  Automated Curriculum Learning for Neural Networks , 2017, ICML.

[25]  C A Nelson,et al.  Learning to Learn , 2017, Encyclopedia of Machine Learning and Data Mining.

[26]  Nicolas Guizard,et al.  CASED: Curriculum Adaptive Sampling for Extreme Data Imbalance , 2017, MICCAI.