Using Perturbed Length-aware Positional Encoding for Non-autoregressive Neural Machine Translation

Non-autoregressive neural machine translation (NAT) usually employs sequence-level knowledge distillation using autoregressive neural machine translation (AT) as its teacher model. However, a NAT model often outputs shorter sentences than an AT model. In this work, we propose sequence-level knowledge distillation (SKD) using perturbed length-aware positional encoding and apply it to a student model, the Levenshtein Transformer. Our method outperformed a standard Levenshtein Transformer by 2.5 points in bilingual evaluation understudy (BLEU) at maximum in a WMT14 German to English translation. The NAT model output longer sentences than the baseline NAT models.

[1]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[2]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[3]  Toshiaki Nakazawa,et al.  ASPEC: Asian Scientific Paper Excerpt Corpus , 2016, LREC.

[4]  Graham Neubig,et al.  Understanding Knowledge Distillation in Non-autoregressive Machine Translation , 2020, ICLR.

[5]  Myle Ott,et al.  fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[6]  Changhan Wang,et al.  Levenshtein Transformer , 2019, NeurIPS.

[7]  Jiajun Zhang,et al.  Addressing the Under-Translation Problem from the Entropy Perspective , 2019, AAAI.

[8]  Taku Kudo,et al.  SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[9]  Matt Post,et al.  A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[10]  Katsuhito Sudoh,et al.  Incorporating Noisy Length Constraints into Transformer with Length-aware Positional Encodings , 2020, COLING.

[11]  Taku Kudo,et al.  MeCab : Yet Another Part-of-Speech and Morphological Analyzer , 2005 .

[12]  Satoshi Nakamura,et al.  Length-constrained Neural Machine Translation using Length Prediction and Perturbation into Length-aware Positional Encoding , 2021, Journal of Natural Language Processing.

[13]  Philipp Koehn,et al.  Findings of the 2014 Workshop on Statistical Machine Translation , 2014, WMT@ACL.

[14]  Naoaki Okazaki,et al.  Positional Encoding to Control Output Sequence Length , 2019, NAACL.

[15]  Marcello Federico,et al.  Controlling the Output Length of Neural Machine Translation , 2019, IWSLT.

[16]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[17]  Omer Levy,et al.  Mask-Predict: Parallel Decoding of Conditional Masked Language Models , 2019, EMNLP.