In Defense of the Unitary Scalarization for Deep Multi-Task Learning

Recent multi-task learning research argues against unitary scalarization, where training simply minimizes the sum of the task losses. Several ad-hoc multi-task optimization algorithms have instead been proposed, inspired by various hypotheses about what makes multi-task settings difficult. The majority of these optimizers require per-task gradients, and introduce significant memory, runtime, and implementation overhead. We show that unitary scalarization, coupled with standard regularization and stabilization techniques from single-task learning, matches or improves upon the performance of complex multi-task optimizers in popular supervised and reinforcement learning settings. We then present an analysis suggesting that many specialized multi-task optimizers can be partly interpreted as forms of regularization, potentially explaining our surprising results. We believe our results call for a critical reevaluation of recent research in the area.

[1]  J. Gilmer,et al.  Do Current Multi-Task Optimization Methods in Deep Learning Even Help? , 2022, NeurIPS.

[2]  Ethan Fetaya,et al.  Multi-Task Learning as a Bargaining Game , 2022, ICML.

[3]  Peter Stone,et al.  Conflict-Averse Gradient Descent for Multi-task Learning , 2021, NeurIPS.

[4]  Qiang Yang,et al.  Multi-Task Learning in Natural Language Processing: An Overview , 2021, ArXiv.

[5]  S. Levine,et al.  MT-Opt: Continuous Multi-Task Robotic Reinforcement Learning at Scale , 2021, ArXiv.

[6]  I. Valera,et al.  RotoGrad: Gradient Homogenization in Multitask Learning , 2021, ICLR.

[7]  Andrea Lodi,et al.  Combinatorial optimization and reasoning with graph neural networks , 2021, IJCAI.

[8]  Joelle Pineau,et al.  Multi-Task Reinforcement Learning with Context-based Representations , 2021, ICML.

[9]  Doina Precup,et al.  Towards Continual Reinforcement Learning: A Review and Perspectives , 2020, J. Artif. Intell. Res..

[10]  Shimon Whiteson,et al.  Can Q-Learning with Graph Networks Learn a Generalizable Branching Heuristic for a SAT Solver? , 2020, NeurIPS.

[11]  Dragomir Anguelov,et al.  Just Pick a Sign: Optimizing Deep Multitask Models with Gradient Sign Dropout , 2020, NeurIPS.

[12]  Yulia Tsvetkov,et al.  Gradient Vaccine: Investigating and Improving Multi-task Optimization in Massively Multilingual Models , 2020, ICLR.

[13]  Tim Rocktäschel,et al.  My Body is a Cage: the Role of Morphology in Graph-Based Incompatible Control , 2020, ICLR.

[14]  Wenlong Huang,et al.  One Policy to Control Them All: Shared Modular Policies for Agent-Agnostic Control , 2020, ICML.

[15]  Daniel Ulbricht,et al.  Learning to Branch for Multi-Task Learning , 2020, ICML.

[16]  Wouter Van Gansbeke,et al.  Multi-Task Learning for Dense Prediction Tasks: A Survey , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Timothy M. Hospedales,et al.  Meta-Learning in Neural Networks: A Survey , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Matthew E. Taylor,et al.  Curriculum Learning for Reinforcement Learning Domains: A Framework and Survey , 2020, J. Mach. Learn. Res..

[19]  S. Levine,et al.  Gradient Surgery for Multi-Task Learning , 2020, NeurIPS.

[20]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[21]  S. Levine,et al.  Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning , 2019, CoRL.

[22]  Samet Oymak,et al.  Gradient Descent with Early Stopping is Provably Robust to Label Noise for Overparameterized Neural Networks , 2019, AISTATS.

[23]  Vladlen Koltun,et al.  Multi-Task Learning as Multi-Objective Optimization , 2018, NeurIPS.

[24]  Wojciech Czarnecki,et al.  Multi-task Deep Reinforcement Learning with PopArt , 2018, AAAI.

[25]  Li Fei-Fei,et al.  Dynamic Task Prioritization for Multitask Learning , 2018, ECCV.

[26]  Yuanzhi Li,et al.  A Convergence Theory for Deep Learning via Over-Parameterization , 2018, ICML.

[27]  Andrew J. Davison,et al.  End-To-End Multi-Task Learning With Attention , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Yuanzhi Li,et al.  An Alternative View: When Does SGD Escape Local Minima? , 2018, ICML.

[29]  George Papandreou,et al.  Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation , 2018, ECCV.

[30]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[31]  Raef Bassily,et al.  The Power of Interpolation: Understanding the Effectiveness of SGD in Modern Over-parametrized Learning , 2017, ICML.

[32]  Zhao Chen,et al.  GradNorm: Gradient Normalization for Adaptive Loss Balancing in Deep Multitask Networks , 2017, ICML.

[33]  Geoffrey E. Hinton,et al.  Dynamic Routing Between Capsules , 2017, NIPS.

[34]  Yee Whye Teh,et al.  Distral: Robust multitask reinforcement learning , 2017, NIPS.

[35]  Thomas A. Funkhouser,et al.  Dilated Residual Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Roberto Cipolla,et al.  Multi-task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[37]  Tom Schaul,et al.  Reinforcement Learning with Unsupervised Auxiliary Tasks , 2016, ICLR.

[38]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[39]  Iasonas Kokkinos,et al.  UberNet: Training a Universal Convolutional Neural Network for Low-, Mid-, and High-Level Vision Using Diverse Datasets and Limited Memory , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  A. Gupta,et al.  Cross-Stitch Networks for Multi-task Learning , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Sebastian Ramos,et al.  The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  David Silver,et al.  Learning values across many orders of magnitude , 2016, NIPS.

[43]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Ruslan Salakhutdinov,et al.  Actor-Mimic: Deep Multitask and Transfer Reinforcement Learning , 2015, ICLR.

[45]  Razvan Pascanu,et al.  Policy Distillation , 2015, ICLR.

[46]  Roberto Cipolla,et al.  SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[47]  Christian Szegedy,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[48]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[49]  Xiaogang Wang,et al.  Deep Learning Face Attributes in the Wild , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[50]  Jasha Droppo,et al.  Multi-task learning in deep neural networks for improved phoneme recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[51]  J. Désidéri Multiple-gradient descent algorithm (MGDA) for multiobjective optimization , 2012 .

[52]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[53]  Massimiliano Pontil,et al.  Regularized multi--task learning , 2004, KDD.

[54]  Tom Heskes,et al.  Task Clustering and Gating for Bayesian Multitask Learning , 2003, J. Mach. Learn. Res..

[55]  Jörg Fliege,et al.  Steepest descent methods for multicriteria optimization , 2000, Math. Methods Oper. Res..

[56]  Tom Heskes,et al.  Empirical Bayes for Learning to Learn , 2000, ICML.

[57]  Rich Caruana,et al.  Multitask Learning , 1997, Machine Learning.

[58]  Thomas G. Dietterich Overfitting and undercomputing in machine learning , 1995, CSUR.

[59]  Yu Zhang,et al.  A Closer Look at Loss Weighting in Multi-Task Learning , 2021, ArXiv.

[60]  Qingmin Liao,et al.  Towards Impartial Multi-task Learning , 2021, ICLR.

[61]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[62]  Simon Haykin,et al.  GradientBased Learning Applied to Document Recognition , 2001 .

[63]  Rich Caruana,et al.  Overfitting in Neural Nets: Backpropagation, Conjugate Gradient, and Early Stopping , 2000, NIPS.

[64]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.