Normalization of Input-output Shared Embeddings in Text Generation Models

Neural Network based models have been state-of-the-art models for various Natural Language Processing tasks, however, the input and output dimension problem in the networks has still not been fully resolved, especially in text generation tasks (e.g. Machine Translation, Text Summarization), in which input and output both have huge sizes of vocabularies. Therefore, input-output embedding weight sharing has been introduced and adopted widely, which remains to be improved. Based on linear algebra and statistical theories, this paper locates the shortcoming of existed input-output embedding weight sharing method, then raises methods for improving input-output weight shared embedding, among which methods of normalization of embedding weight matrices show best performance. These methods are nearly computational cost-free, can get combined with other embedding techniques, and show good effectiveness when applied on state-of-the-art Neural Network models. For Transformer-big models, the normalization techniques can get at best 0.6 BLEU improvement compared to the original version of model on WMT'16 En-De dataset, and similar BLEU improvements on IWSLT 14' datasets. For DynamicConv models, 0.5 BLEU improvement can be attained on WMT'16 En-De dataset, and 0.41 BLEU improvement on IWSLT 14' De-En translation task is achieved.

[1]  Taku Kudo,et al.  SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[2]  Moustapha Cissé,et al.  Efficient softmax approximation for GPUs , 2016, ICML.

[3]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[4]  Changhan Wang,et al.  Levenshtein Transformer , 2019, NeurIPS.

[5]  David Chiang,et al.  Improving Lexical Choice in Neural Machine Translation , 2017, NAACL.

[6]  Marc'Aurelio Ranzato,et al.  Classical Structured Prediction Losses for Sequence to Sequence Learning , 2017, NAACL.

[7]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[8]  Dong Wang,et al.  Normalized Word Embedding and Orthogonal Transform for Bilingual Word Translation , 2015, NAACL.

[9]  Vikas Raunak Simple and Effective Dimensionality Reduction for Word Embeddings , 2017 .

[10]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[11]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[12]  Yann Dauphin,et al.  Pay Less Attention with Lightweight and Dynamic Convolutions , 2019, ICLR.

[13]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[14]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[15]  Myle Ott,et al.  Understanding Back-Translation at Scale , 2018, EMNLP.

[16]  Alexei Baevski,et al.  Adaptive Input Representations for Neural Language Modeling , 2018, ICLR.

[17]  Lior Wolf,et al.  Using the Output Embedding to Improve Language Models , 2016, EACL.

[18]  Di He,et al.  Tied Transformers: Neural Machine Translation with Shared Encoder and Decoder , 2019, AAAI.

[19]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[20]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[21]  Julian Salazar,et al.  Transformers without Tears: Improving the Normalization of Self-Attention , 2019, ArXiv.

[22]  Myle Ott,et al.  Scaling Neural Machine Translation , 2018, WMT.

[23]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[24]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .