On Anytime Learning at Macroscale

Classical machine learning frameworks assume access to a possibly large dataset in order to train a predictive model. In many practical applications however, data does not arrive all at once, but in batches over time. This creates a natural trade-off between accuracy of a model and time to obtain such a model. A greedy predictor could produce non-trivial predictions by immediately training on batches as soon as these become available but, it may also make sub-optimal use of future data. On the other hand, a tardy predictor could wait for a long time to aggregate several batches into a larger dataset, but ultimately deliver a much better performance. In this work, we consider such a streaming learning setting, which we dub anytime learning at macroscale (ALMA). It is an instance of anytime learning applied not at the level of a single chunk of data, but at the level of the entire sequence of large batches. We first formalize this learning setting, we then introduce metrics to assess how well learners perform on the given task for a given memory and compute budget, and finally we test several baseline approaches on standard benchmarks repurposed for anytime learning at macroscale. The general finding is that bigger models always generalize better. In particular, it is important to grow model capacity over time if the initial model is relatively small. Moreover, updating the model at an intermediate rate strikes the best trade off between accuracy and time to obtain a useful predictor.

[1]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[2]  Alexander H. Waibel,et al.  Adaptively Growing Hierarchical Mixtures of Experts , 1996, NIPS.

[3]  Myle Ott,et al.  fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[4]  John J. Grefenstette,et al.  Case-Based Anytime Learning , 1994 .

[5]  Dan Klein,et al.  Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers , 2020, ArXiv.

[6]  Martial Hebert,et al.  Growing a Brain: Fine-Tuning by Increasing Model Capacity , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[8]  Gregory Cohen,et al.  EMNIST: Extending MNIST to handwritten letters , 2017, 2017 International Joint Conference on Neural Networks (IJCNN).

[9]  Steve R. Waterhouse,et al.  Constructive Algorithms for Hierarchical Mixtures of Experts , 1995, NIPS.

[10]  Sebastian Thrun,et al.  Lifelong Learning Algorithms , 1998, Learning to Learn.

[11]  Sebastian Thrun,et al.  A lifelong learning perspective for mobile robot control , 1994, Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS'94).

[12]  Bo Liu,et al.  Firefly Neural Architecture Descent: a General Approach for Growing Neural Networks , 2021, NeurIPS.

[13]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[14]  Feng Yan,et al.  AutoGrow: Automatic Layer Growing in Deep Convolutional Networks , 2019, KDD.

[15]  Noam Shazeer,et al.  Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity , 2021, ArXiv.

[16]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[17]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[18]  P. Bühlmann,et al.  Analyzing Bagging , 2001 .

[19]  Mark B. Ring Continual learning in reinforcement environments , 1995, GMD-Bericht.

[20]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[21]  Ludovic Denoyer,et al.  Efficient Continual Learning with Modular Networks and Task-Driven Priors , 2020, ArXiv.

[22]  Robert A. Jacobs,et al.  Hierarchical Mixtures of Experts and the EM Algorithm , 1993, Neural Computation.

[23]  Sergio Escalera,et al.  Towards Automated Deep Learning: Analysis of the AutoDL challenge series 2019 , 2019, Proceedings of Machine Learning Research.

[24]  Sung Ju Hwang,et al.  Lifelong Learning with Dynamically Expandable Networks , 2017, ICLR.

[25]  Ludovic Denoyer Deep Sequential Neural Networks , 2014 .

[26]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[27]  Marc'Aurelio Ranzato,et al.  Learning Factored Representations in a Deep Mixture of Experts , 2013, ICLR.

[28]  Alec Radford,et al.  Scaling Laws for Neural Language Models , 2020, ArXiv.

[29]  John J. Grefenstette,et al.  An Approach to Anytime Learning , 1992, ML.

[30]  Xuan Liang,et al.  On the Subbagging Estimation for Massive Data , 2021 .

[31]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[32]  Qiang Liu,et al.  Splitting Steepest Descent for Growing Neural Architectures , 2019, NeurIPS.

[33]  Chrisantha Fernando,et al.  PathNet: Evolution Channels Gradient Descent in Super Neural Networks , 2017, ArXiv.

[34]  BreimanLeo Pasting Small Votes for Classification in Large Databases and On-Line , 1999 .

[35]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[36]  Orhan Firat,et al.  GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding , 2020, ICLR.

[37]  Yoshua Bengio,et al.  Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation , 2013, ArXiv.

[38]  Mark B. Ring CHILD: A First Step Towards Continual Learning , 1997, Machine Learning.

[39]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.