Multi-Task Learning as Multi-Objective Optimization

In multi-task learning, multiple tasks are solved jointly, sharing inductive bias between them. Multi-task learning is inherently a multi-objective problem because different tasks may conflict, necessitating a trade-off. A common compromise is to optimize a proxy objective that minimizes a weighted linear combination of per-task losses. However, this workaround is only valid when the tasks do not compete, which is rarely the case. In this paper, we explicitly cast multi-task learning as multi-objective optimization, with the overall objective of finding a Pareto optimal solution. To this end, we use algorithms developed in the gradient-based multi-objective optimization literature. These algorithms are not directly applicable to large-scale learning problems since they scale poorly with the dimensionality of the gradients and the number of tasks. We therefore propose an upper bound for the multi-objective loss and show that it can be optimized efficiently. We further prove that optimizing this upper bound yields a Pareto optimal solution under realistic assumptions. We apply our method to a variety of multi-task deep learning problems including digit classification, scene understanding (joint semantic segmentation, instance segmentation, and depth estimation), and multi-label classification. Our method produces higher-performing models than recent multi-task learning formulations or per-task training.

[1]  Philip Wolfe,et al.  Finding the nearest point in A polytope , 1976, Math. Program..

[2]  Kazuyuki Sekitani,et al.  A recursive algorithm for finding the minimum norm point in a polytope and a pair of closest points in two polytopes , 1993, Math. Program..

[3]  Naoki Makimoto,et al.  An efficient algorithm for finding the minimum norm point in the convex hull of a finite point set in the plane , 1994, Oper. Res. Lett..

[4]  Dimitri P. Bertsekas,et al.  Nonlinear Programming , 1997 .

[5]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[6]  Rich Caruana,et al.  Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.

[7]  Kaisa Miettinen,et al.  Nonlinear multiobjective optimization , 1998, International series in operations research and management science.

[8]  Jonathan Baxter,et al.  A Model of Inductive Bias Learning , 2000, J. Artif. Intell. Res..

[9]  Jörg Fliege,et al.  Steepest descent methods for multicriteria optimization , 2000, Math. Methods Oper. Res..

[10]  S. Schäffler,et al.  Stochastic Method for the Solution of Unconstrained Vector Optimization Problems , 2002 .

[11]  Tom Heskes,et al.  Task Clustering and Gating for Bayesian Multitask Learning , 2003, J. Mach. Learn. Res..

[12]  Matthias Ehrgott,et al.  Multicriteria Optimization , 2005 .

[13]  J. Neyman,et al.  INADMISSIBILITY OF THE USUAL ESTIMATOR FOR THE MEAN OF A MULTIVARIATE NORMAL DISTRIBUTION , 2005 .

[14]  Christoph F. Eick,et al.  Content-based image retrieval through a multi-agent meta-learning framework , 2005, 17th IEEE International Conference on Tools with Artificial Intelligence (ICTAI'05).

[15]  Massimiliano Pontil,et al.  Multi-Task Feature Learning , 2006, NIPS.

[16]  Lawrence Carin,et al.  Multi-Task Learning for Classification with Dirichlet Process Priors , 2007, J. Mach. Learn. Res..

[17]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[18]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[19]  Dit-Yan Yeung,et al.  A Convex Formulation for Learning Task Relationships in Multi-Task Learning , 2010, UAI.

[20]  Jiayu Zhou,et al.  Clustered Multi-Task Learning Via Alternating Structure Optimization , 2011, NIPS.

[21]  Carlos Soares,et al.  Combining a multi-objective optimization approach with meta-learning for SVM parameter selection , 2012, 2012 IEEE International Conference on Systems, Man, and Cybernetics (SMC).

[22]  J. Désidéri Multiple-gradient descent algorithm (MGDA) for multiobjective optimization , 2012 .

[23]  Yifan Gong,et al.  Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[24]  Steve R. Gunn,et al.  Towards Pareto Descent Directions in Sampling Experts for Multiple Tasks in an On-Line Learning Paradigm , 2013, AAAI Spring Symposium: Lifelong Machine Learning.

[25]  Jasha Droppo,et al.  Multi-task learning in deep neural networks for improved phoneme recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[26]  Martin Jaggi,et al.  Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization , 2013, ICML.

[27]  Shimon Whiteson,et al.  A Survey of Multi-Objective Sequential Decision-Making , 2013, J. Artif. Intell. Res..

[28]  Cong Li,et al.  Pareto-Path Multi-Task Multiple Kernel Learning , 2014, ArXiv.

[29]  Luca Bascetta,et al.  Policy gradient approaches for multi-objective sequential decision making , 2014, 2014 International Joint Conference on Neural Networks (IJCNN).

[30]  Cong Li,et al.  Pareto-Path Multitask Multiple Kernel Learning , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[31]  Ji Wu,et al.  Rapid adaptation for deep neural networks through multi-task learning , 2015, INTERSPEECH.

[32]  Xiaogang Wang,et al.  Deep Learning Face Attributes in the Wild , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[33]  Jianmin Wang,et al.  Learning Multiple Tasks with Deep Relationship Networks , 2015, ArXiv.

[34]  Dianhai Yu,et al.  Multi-Task Learning for Multiple Language Translation , 2015, ACL.

[35]  Xiaodong Liu,et al.  Representation Learning Using Multi-Task Deep Neural Networks for Semantic Classification and Information Retrieval , 2015, NAACL.

[36]  Daniel Hern'andez-Lobato,et al.  Predictive Entropy Search for Multi-objective Bayesian Optimization with Constraints , 2016, Neurocomputing.

[37]  Andrea Vedaldi,et al.  Integrated perception with recurrent multi-task neural networks , 2016, NIPS.

[38]  Michael Dellnitz,et al.  Gradient-Based Multiobjective Optimization with Uncertainties , 2016, 1612.03815.

[39]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Terrance E. Boult,et al.  MOON: A Mixed Objective Optimization Network for the Recognition of Facial Attributes , 2016, ECCV.

[41]  Zoubin Ghahramani,et al.  Pareto Frontier Learning with Expensive Correlated Objectives , 2016, ICML.

[42]  Marcello Restelli,et al.  Inverse Reinforcement Learning through Policy Gradient Minimization , 2016, AAAI.

[43]  Sebastian Ramos,et al.  The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Quoc V. Le,et al.  Multi-task Sequence to Sequence Learning , 2015, ICLR.

[45]  Martial Hebert,et al.  Cross-Stitch Networks for Multi-task Learning , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Xiaogang Wang,et al.  Pyramid Scene Parsing Network , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Bolei Zhou,et al.  Scene Parsing through ADE20K Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Jean-Antoine Désidéri,et al.  Descent algorithm for nonsmooth stochastic multiobjective optimization , 2017, Comput. Optim. Appl..

[49]  Sebastian Ruder,et al.  An Overview of Multi-Task Learning in Deep Neural Networks , 2017, ArXiv.

[50]  Yongxin Yang,et al.  Trace Norm Regularised Deep Multi-Task Learning , 2016, ICLR.

[51]  Iasonas Kokkinos,et al.  UberNet: Training a Universal Convolutional Neural Network for Low-, Mid-, and High-Level Vision Using Diverse Datasets and Limited Memory , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Yoshimasa Tsuruoka,et al.  A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks , 2016, EMNLP.

[53]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[54]  Geoffrey E. Hinton,et al.  Dynamic Routing Between Capsules , 2017, NIPS.

[55]  Leonidas J. Guibas,et al.  Taskonomy: Disentangling Task Transfer Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[56]  Zhao Chen,et al.  GradNorm: Gradient Normalization for Adaptive Loss Balancing in Deep Multitask Networks , 2017, ICML.

[57]  Bin Jiang,et al.  Multi-Task Multi-View Learning Based on Cooperative Multi-Objective Optimization , 2018, IEEE Access.

[58]  Roberto Cipolla,et al.  Multi-task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[59]  Matthew Riemer,et al.  Routing Networks: Adaptive Selection of Non-linear Functions for Multi-Task Learning , 2017, ICLR.

[60]  K. Schittkowski,et al.  NONLINEAR PROGRAMMING , 2022 .