AutoSeM: Automatic Task Selection and Mixing in Multi-Task Learning

Multi-task learning (MTL) has achieved success over a wide range of problems, where the goal is to improve the performance of a primary task using a set of relevant auxiliary tasks. However, when the usefulness of the auxiliary tasks w.r.t. the primary task is not known a priori, the success of MTL models depends on the correct choice of these auxiliary tasks and also a balanced mixing ratio of these tasks during alternate training. These two problems could be resolved via manual intuition or hyper-parameter tuning over all combinatorial task choices, but this introduces inductive bias or is not scalable when the number of candidate auxiliary tasks is very large. To address these issues, we present AutoSeM, a two-stage MTL pipeline, where the first stage automatically selects the most useful auxiliary tasks via a Beta-Bernoulli multi-armed bandit with Thompson Sampling, and the second stage learns the training mixing ratio of these selected auxiliary tasks via a Gaussian Process based Bayesian optimization framework. We conduct several MTL experiments on the GLUE language understanding tasks, and show that our AutoSeM framework can successfully find relevant auxiliary tasks and automatically learn their mixing ratio, achieving significant performance boosts on several primary tasks. Finally, we present ablations for each stage of AutoSeM and analyze the learned auxiliary task choices.

[1]  Nando de Freitas,et al.  Portfolio Allocation for Bayesian Optimization , 2010, UAI.

[2]  Anders Søgaard Data point selection for cross-language adaptation of dependency parsers , 2011, ACL.

[3]  Jasper Snoek,et al.  Practical Bayesian Optimization of Machine Learning Algorithms , 2012, NIPS.

[4]  Lihong Li,et al.  An Empirical Evaluation of Thompson Sampling , 2011, NIPS.

[5]  D. Sculley,et al.  Google Vizier: A Service for Black-Box Optimization , 2017, KDD.

[6]  Ruslan Salakhutdinov,et al.  Actor-Mimic: Deep Multitask and Transfer Reinforcement Learning , 2015, ICLR.

[7]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[8]  Yulia Tsvetkov,et al.  Learning the Curriculum with Bayesian Optimization for Task-Specific Word Representation Learning , 2016, ACL.

[9]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[10]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[11]  Roberto Cipolla,et al.  Multi-task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[12]  Holger Schwenk,et al.  Supervised Learning of Universal Sentence Representations from Natural Language Inference Data , 2017, EMNLP.

[13]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[14]  Christof Monz,et al.  Dynamic Data Selection for Neural Machine Translation , 2017, EMNLP.

[15]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[16]  Balaraman Ravindran,et al.  Online Multi-Task Learning Using Active Sampling , 2017, ICLR.

[17]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[18]  Barbara Plank,et al.  Learning to select data for transfer learning with Bayesian Optimization , 2017, EMNLP.

[19]  Wenqing Chen,et al.  Gated Multi-Task Network for Text Classification , 2018, NAACL.

[20]  Ramakanth Pasunuru,et al.  Towards Improving Abstractive Summarization via Entailment Generation , 2017, NFiS@EMNLP.

[21]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[22]  Tom Dhaene,et al.  GPflowOpt: A Bayesian Optimization Library using TensorFlow , 2017, NIPS 2017.

[23]  Quoc V. Le,et al.  Multi-task Sequence to Sequence Learning , 2015, ICLR.

[24]  Alex Graves,et al.  Automated Curriculum Learning for Neural Networks , 2017, ICML.

[25]  Richard Socher,et al.  The Natural Language Decathlon: Multitask Learning as Question Answering , 2018, ArXiv.

[26]  Sheetal Kalyani,et al.  Taming Non-stationary Bandits: A Bayesian Approach , 2017, ArXiv.

[27]  Joachim Bingel,et al.  Sluice networks: Learning what to share between loosely related tasks , 2017, ArXiv.

[28]  Yee Whye Teh,et al.  Distral: Robust multitask reinforcement learning , 2017, NIPS.

[29]  Nando de Freitas,et al.  A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement Learning , 2010, ArXiv.

[30]  Martial Hebert,et al.  Cross-Stitch Networks for Multi-task Learning , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Mohit Bansal,et al.  Material : MultiTask Video Captioning with Video and Entailment Generation , 2017 .

[32]  Nicolò Cesa-Bianchi,et al.  Gambling in a rigged casino: The adversarial multi-armed bandit problem , 1995, Proceedings of IEEE 36th Annual Foundations of Computer Science.

[33]  Tom Schaul,et al.  Reinforcement Learning with Unsupervised Auxiliary Tasks , 2016, ICLR.

[34]  Joachim Bingel,et al.  Identifying beneficial task relations for multi-task learning in deep neural networks , 2017, EACL.

[35]  Yoshimasa Tsuruoka,et al.  A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks , 2016, EMNLP.

[36]  Jasper Snoek,et al.  Multi-Task Bayesian Optimization , 2013, NIPS.

[37]  Andreas Krause,et al.  A tutorial on Gaussian process regression: Modelling, exploring, and exploiting functions , 2016, bioRxiv.

[38]  Benjamin Van Roy,et al.  A Tutorial on Thompson Sampling , 2017, Found. Trends Mach. Learn..

[39]  Rich Caruana,et al.  Multitask Learning , 1997, Machine-mediated learning.

[40]  Kevin Duh,et al.  Adaptation Data Selection using Neural Language Models: Experiments in Machine Translation , 2013, ACL.

[41]  Filip De Turck,et al.  VIME: Variational Information Maximizing Exploration , 2016, NIPS.

[42]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[43]  T. Ben-David,et al.  Exploiting Task Relatedness for Multiple , 2003 .

[44]  Ramakanth Pasunuru,et al.  Dynamic Multi-Level Multi-Task Learning for Sentence Simplification , 2018, COLING.

[45]  Antoine Cully,et al.  Limbo: A Flexible High-performance Library for Gaussian Processes modeling and Data-Efficient Optimization , 2018, J. Open Source Softw..

[46]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[47]  Joachim Bingel,et al.  Latent Multi-Task Architecture Learning , 2017, AAAI.

[48]  Nando de Freitas,et al.  Taking the Human Out of the Loop: A Review of Bayesian Optimization , 2016, Proceedings of the IEEE.

[49]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[50]  William D. Lewis,et al.  Intelligent Selection of Language Model Training Data , 2010, ACL.

[51]  Lukasz Kaiser,et al.  One Model To Learn Them All , 2017, ArXiv.

[52]  Jian Sun,et al.  Instance-Aware Semantic Segmentation via Multi-task Network Cascades , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Andreas Krause,et al.  Information-Theoretic Regret Bounds for Gaussian Process Optimization in the Bandit Setting , 2009, IEEE Transactions on Information Theory.