Adaptive Sparse Transformer for Multilingual Translation

Multilingual machine translation has attracted much attention recently due to its support of knowledge transfer among languages and the low cost of training and deployment compared with numerous bilingual models. A known challenge of multilingual models is the negative language interference. In order to enhance the translation quality, deeper and wider architectures are applied to multilingual modeling for larger model capacity, which suffers from the increased inference cost at the same time. It has been pointed out in recent studies that parameters shared among languages are the cause of interference while they may also enable positive transfer. Based on these insights, we propose an adaptive and sparse architecture for multilingual modeling, and train the model to learn shared and language-specific parameters to improve the positive transfer and mitigate the interference. The sparse architecture only activates a sub-network which preserves inference efficiency, and the adaptive design selects different sub-networks based on the input languages. Our model outperforms strong baselines across multiple benchmarks. On the large-scale OPUS dataset with 100 languages, we achieve +2.1, +1.3 and +6.2 BLEU improvements in one-to-many, many-to-one and zero-shot tasks respectively compared to standard Transformer without increasing the inference cost.

[1]  Yike Guo,et al.  Regularizing Deep Multi-Task Networks using Orthogonal Gradients , 2019, ArXiv.

[2]  Deniz Yuret,et al.  Transfer Learning for Low-Resource Neural Machine Translation , 2016, EMNLP.

[3]  Rico Sennrich,et al.  Improving Massively Multilingual Neural Machine Translation and Zero-Shot Translation , 2020, ACL.

[4]  Feifei Zhai,et al.  A Compact and Language-Sensitive Multilingual Translation Method , 2019, ACL.

[5]  Pushpak Bhattacharyya,et al.  Multilingual Unsupervised NMT using Shared Encoder and Language-Specific Decoders , 2019, ACL.

[6]  Graham Neubig,et al.  Balancing Training for Multilingual Neural Machine Translation , 2020, ACL.

[7]  Graham Neubig,et al.  Choosing Transfer Languages for Cross-Lingual Learning , 2019, ACL.

[8]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[9]  Graham Neubig,et al.  Parameter Sharing Methods for Multilingual Self-Attentional Translation Models , 2018, WMT.

[10]  Omer Levy,et al.  Are Sixteen Heads Really Better than One? , 2019, NeurIPS.

[11]  Kevin Knight,et al.  Multi-Source Neural Translation , 2016, NAACL.

[12]  Veselin Stoyanov,et al.  Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[13]  Marjan Ghazvininejad,et al.  Multilingual Denoising Pre-training for Neural Machine Translation , 2020, Transactions of the Association for Computational Linguistics.

[14]  Orhan Firat,et al.  GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding , 2020, ICLR.

[15]  Adithya Renduchintala,et al.  Multilingual Neural Machine Translation with Deep Encoder and Multiple Shallow Decoders , 2022, EACL.

[16]  André F. T. Martins,et al.  Adaptively Sparse Transformers , 2019, EMNLP.

[17]  Xian Li,et al.  Deep Transformers with Latent Depth , 2020, NeurIPS.

[18]  Juan Pino,et al.  Pay Better Attention to Attention: Head Selection in Multilingual and Multi-Domain Sequence Modeling , 2021, NeurIPS.

[19]  Ankur Bapna,et al.  Simple, Scalable Adaptation for Neural Machine Translation , 2019, EMNLP.

[20]  Sarah L. Nesbeitt Ethnologue: Languages of the World , 1999 .

[21]  Yulia Tsvetkov,et al.  Gradient Vaccine: Investigating and Improving Multi-task Optimization in Massively Multilingual Models , 2020, ICLR.

[22]  Ankur Bapna,et al.  Share or Not? Learning to Schedule Language-Specific Capacity for Multilingual Translation , 2021, ICLR.

[23]  Tao Qin,et al.  Multilingual Neural Machine Translation with Language Clustering , 2019, EMNLP.

[24]  Michael Carbin,et al.  The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks , 2018, ICLR.

[25]  Ben Poole,et al.  Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[26]  Noam Shazeer,et al.  Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity , 2021, ArXiv.

[27]  Graham Neubig,et al.  When and Why Are Pre-Trained Word Embeddings Useful for Neural Machine Translation? , 2018, NAACL.

[28]  Ankur Bapna,et al.  Massively Multilingual Neural Machine Translation in the Wild: Findings and Challenges , 2019, ArXiv.

[29]  S. Levine,et al.  Gradient Surgery for Multi-Task Learning , 2020, NeurIPS.

[30]  Matthias Gallé,et al.  Language Adapters for Zero Shot Neural Machine Translation , 2020, EMNLP.

[31]  Chenhui Chu,et al.  A Survey of Multilingual Neural Machine Translation , 2019, ACM Comput. Surv..