Gradient Coordination for Quantifying and Maximizing Knowledge Transference in Multi-Task Learning

Multi-task learning (MTL) has been widely applied in online advertising systems. To address the negative transfer issue, recent optimization methods emphasized the gradient alignment of directions or magnitudes. Since prior studies have proven that the shared modules contain both general and specific knowledge, overemphasizing on gradient alignment may crowd out task-specific knowledge. In this paper, we propose a transference-driven approach CoGrad that adaptively maximizes knowledge transference via Coordinated Gradient modification. We explicitly quantify the transference as loss reduction from one task to another, and optimize it to derive an auxiliary gradient. By incorporating this gradient into original task gradients, the model automatically maximizes inter-task transfer and minimizes individual losses, leading to general and specific knowledge harmonization. Besides, we introduce an efficient approximation of the Hessian matrix, making CoGrad computationally efficient. Both offline and online experiments verify that CoGrad significantly outperforms previous methods.

[1]  James Caverlee,et al.  MetaBalance: Improving Multi-Task Recommendations via Adapting Gradient Magnitudes of Auxiliary Tasks , 2022, WWW.

[2]  Kuang-chih Lee,et al.  AutoHERI: Automated Hierarchical Representation Integration for Post-Click Conversion Rate Estimation , 2021, CIKM.

[3]  Haebeom Lee,et al.  Sequential Reptile: Inter-Task Gradient Alignment for Multilingual Learning , 2021, ICLR.

[4]  Christopher Fifty,et al.  Efficiently Identifying Task Groupings for Multi-Task Learning , 2021, Neural Information Processing Systems.

[5]  Fuyu Lv,et al.  Hierarchically Modeling Micro and Macro Behaviors via Multi-Task Learning for Conversion Rate Prediction , 2021, SIGIR.

[6]  Yulia Tsvetkov,et al.  Gradient Vaccine: Investigating and Improving Multi-task Optimization in Massively Multilingual Models , 2020, ICLR.

[7]  Yulia Tsvetkov,et al.  On Negative Interference in Multilingual Language Models , 2020, EMNLP.

[8]  S. Levine,et al.  Gradient Surgery for Multi-Task Learning , 2020, NeurIPS.

[9]  Vladlen Koltun,et al.  Multi-Task Learning as Multi-Objective Optimization , 2018, NeurIPS.

[10]  Zhe Zhao,et al.  Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts , 2018, KDD.

[11]  Xiao Ma,et al.  Entire Space Multi-Task Model: An Effective Approach for Estimating Post-Click Conversion Rate , 2018, SIGIR.

[12]  Zhao Chen,et al.  GradNorm: Gradient Normalization for Adaptive Loss Balancing in Deep Multitask Networks , 2017, ICML.

[13]  Sebastian Ruder,et al.  An Overview of Multi-Task Learning in Deep Neural Networks , 2017, ArXiv.

[14]  Nenghai Yu,et al.  Asynchronous Stochastic Gradient Descent with Delay Compensation , 2016, ICML.

[15]  Julian J. McAuley,et al.  Ups and Downs: Modeling the Visual Evolution of Fashion Trends with One-Class Collaborative Filtering , 2016, WWW.

[16]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[17]  Rich Caruana,et al.  Multitask Learning , 1997, Machine Learning.

[18]  Wayne Xin Zhao,et al.  Personalized Inter-Task Contrastive Learning for CTR&CVR Joint Estimation , 2022, ArXiv.

[19]  Qiang Yang,et al.  An Overview of Multi-task Learning , 2018 .