Deep curriculum learning optimization

We describe a quantitative and practical framework to integrate curriculum learning (CL) into deep learning training pipeline to improve feature learning in deep feed-forward networks. The framework has several unique characteristics: (1) dynamicity—it proposes a set of batch-level training strategies (syllabi or curricula) that are sensitive to data complexity (2) adaptivity—it dynamically estimates the effectiveness of a given strategy and performs objective comparison with alternative strategies making the method suitable both for practical and research purposes. (3) Employs replace–retrain mechanism when a strategy is unfit to the task at hand. In addition to these traits, the framework can combine CL with several variants of gradient descent (GD) algorithms and has been used to generate efficient batch-specific or data-set specific strategies. Comparative studies of various current state-of-the-art vision models, such as FixEfficentNet and BiT-L (ResNet), on several benchmark datasets including CIFAR10 demonstrate the effectiveness of the proposed method. We present results that show training loss reduction by as much as a factor 5. Additionally, we present a set of practical curriculum strategies to improve the generalization performance of select networks on various datasets.

[1]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[2]  B. Bonev Feature Selection based on Information Theory , 2010 .

[3]  D. Weinshall,et al.  Curriculum Learning by Transfer Learning: Theory and Experiments with Deep Networks , 2018, ICML.

[4]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[5]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[6]  Matthijs Douze,et al.  Fixing the train-test resolution discrepancy , 2019, NeurIPS.

[7]  Lucas Beyer,et al.  Big Transfer (BiT): General Visual Representation Learning , 2020, ECCV.

[8]  Kurt Hornik,et al.  Multilayer feedforward networks are universal approximators , 1989, Neural Networks.

[9]  Rajeev Kumar,et al.  Receiver operating characteristic (ROC) curve for medical researchers , 2011, Indian pediatrics.

[10]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Henok Ghebrechristos,et al.  Expediting Training Using Information Theory Based Patch Ordering Algorithm , 2018, 2018 International Conference on Computational Science and Computational Intelligence (CSCI).

[12]  Oluwasanmi Koyejo,et al.  Consistent Binary Classification with Generalized Performance Metrics , 2014, NIPS.

[13]  Sotiris B. Kotsiantis,et al.  Supervised Machine Learning: A Review of Classification Techniques , 2007, Informatica.

[14]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[15]  Yoshua Bengio,et al.  Random Search for Hyper-Parameter Optimization , 2012, J. Mach. Learn. Res..

[16]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[17]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[18]  Joan Bruna,et al.  Intriguing properties of neural networks , 2013, ICLR.

[19]  Michael W. Mahoney,et al.  Rethinking generalization requires revisiting old ideas: statistical mechanics approaches and complex learning behavior , 2017, ArXiv.

[20]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[21]  Quoc V. Le,et al.  EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.

[22]  M. N. Sulaiman,et al.  A Review On Evaluation Metrics For Data Classification Evaluations , 2015 .

[23]  Quoc V. Le,et al.  Self-Training With Noisy Student Improves ImageNet Classification , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[25]  Andrew F. Rex,et al.  Maxwell's Demon, Entropy, Information, Computing , 1990 .

[26]  C. V. Jawahar,et al.  Cats and dogs , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  Jiawei Zhang,et al.  Gradient Descent based Optimization Algorithms for Deep Learning Models Training , 2019, ArXiv.

[28]  Djemel Ziou,et al.  Image Quality Metrics: PSNR vs. SSIM , 2010, 2010 20th International Conference on Pattern Recognition.

[29]  Mateu Sbert,et al.  Information Theory Tools for Image Processing , 2014, Information Theory Tools for Image Processing.

[30]  Kan Chen,et al.  Billion-scale semi-supervised learning for image classification , 2019, ArXiv.

[31]  Carlo Tomasi,et al.  Image Similarity Using Mutual Information of Regions , 2004, ECCV.

[32]  Wojciech Czarnecki,et al.  On Loss Functions for Deep Neural Networks in Classification , 2017, ArXiv.

[33]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[34]  Alex Graves,et al.  Automated Curriculum Learning for Neural Networks , 2017, ICML.

[35]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[36]  Vanya Avramova Curriculum Learning with Deep Convolutional Neural Networks , 2015 .