Revisiting Dropout: Escaping Pressure for Training Neural Networks with Multiple Costs

A common approach to jointly learn multiple tasks with a shared structure is to optimize the model with a combined landscape of multiple sub-costs. However, gradients derived from each sub-cost often conflicts in cost plateaus, resulting in a subpar optimum. In this work, we shed light on such gradient conflict challenges and suggest a solution named Cost-Out, which randomly drops the sub-costs for each iteration. We provide the theoretical and empirical evidence of the existence of escaping pressure induced by the Cost-Out mechanism. While simple, the empirical results indicate that the proposed method can enhance the performance of multi-task learning problems, including two-digit image classification sampled from MNIST dataset and machine translation tasks for English from and to French, Spanish, and German WMT14 datasets.

[1]  Leonidas J. Guibas,et al.  Taskonomy: Disentangling Task Transfer Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[2]  Satoshi Nakamura,et al.  Neural Machine Translation via Binary Code Prediction , 2017, ACL.

[3]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[4]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Iasonas Kokkinos,et al.  UberNet: Training a Universal Convolutional Neural Network for Low-, Mid-, and High-Level Vision Using Diverse Datasets and Limited Memory , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[7]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[8]  Christopher D. Manning,et al.  Get To The Point: Summarization with Pointer-Generator Networks , 2017, ACL.

[9]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[10]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[11]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[12]  Vladlen Koltun,et al.  Multi-Task Learning as Multi-Objective Optimization , 2018, NeurIPS.

[13]  Marcus Rohrbach,et al.  12-in-1: Multi-Task Vision and Language Representation Learning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[15]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[16]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[17]  Jonathan Baxter,et al.  A Model of Inductive Bias Learning , 2000, J. Artif. Intell. Res..

[18]  Rich Caruana,et al.  Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.

[19]  Xiaodong Liu,et al.  Multi-Task Deep Neural Networks for Natural Language Understanding , 2019, ACL.

[20]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[21]  Zhiguo Wang,et al.  Coverage Embedding Models for Neural Machine Translation , 2016, EMNLP.

[22]  Patrice Y. Simard,et al.  Best practices for convolutional neural networks applied to visual document analysis , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[23]  Roberto Cipolla,et al.  Multi-task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[24]  Quoc V. Le,et al.  BAM! Born-Again Multi-Task Networks for Natural Language Understanding , 2019, ACL.

[25]  Wojciech Zaremba,et al.  Recurrent Neural Network Regularization , 2014, ArXiv.

[26]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[27]  Pierre Baldi,et al.  Understanding Dropout , 2013, NIPS.

[28]  Hynek Hermansky,et al.  Sparse Multilayer Perceptron for Phoneme Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[29]  Jörg Fliege,et al.  Steepest descent methods for multicriteria optimization , 2000, Math. Methods Oper. Res..

[30]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[31]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[32]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[33]  Chris Callison-Burch,et al.  Open Source Toolkit for Statistical Machine Translation: Factored Translation Models and Lattice Decoding , 2006 .

[34]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.