Rethinking and Improving Natural Language Generation with Layer-Wise Multi-View Decoding

In sequence-to-sequence learning, the decoder relies on the attention mechanism to efficiently extract information from the encoder. While it is common practice to draw information from only the last encoder layer, recent work has proposed to use representations from different encoder layers for diversified levels of information. Nonetheless, the decoder still obtains only a single view of the source sequences, which might lead to insufficient training of the encoder layer stack due to the hierarchy bypassing problem. In this work, we propose layer-wise cross-view decoding, where for each decoder layer, together with the representations from the last encoder layer, which serve as a global view, those from other encoder layers are supplemented for a stereoscopic view of the source sequences. Systematic experiments show that we successfully address the hierarchy bypassing problem and substantially improve the performance of sequence-to-sequence learning with deep representations on diverse tasks.

[1]  Xu Sun,et al.  A Hierarchical End-to-End Model for Jointly Improving Text Summarization and Sentiment Classification , 2018, IJCAI.

[2]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Angela Fan,et al.  Controllable Abstractive Summarization , 2017, NMT@ACL.

[4]  Xuancheng Ren,et al.  Explicit Sparse Transformer: Concentrated Attention Through Explicit Selection , 2019, ArXiv.

[5]  Basura Fernando,et al.  SPICE: Semantic Propositional Image Caption Evaluation , 2016, ECCV.

[6]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Myle Ott,et al.  Scaling Neural Machine Translation , 2018, WMT.

[8]  Yann Dauphin,et al.  Pay Less Attention with Lightweight and Dynamic Convolutions , 2019, ICLR.

[9]  Ankur Bapna,et al.  Training Deeper Neural Machine Translation Models with Transparent Attention , 2018, EMNLP.

[10]  Trevor Darrell,et al.  Deep Layer Aggregation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[11]  Fenglin Liu,et al.  Exploring and Distilling Cross-Modal Information for Image Captioning , 2019, IJCAI.

[12]  Jingbo Zhu,et al.  Multi-layer Representation Fusion for Neural Machine Translation , 2018, COLING.

[13]  Lemao Liu,et al.  Regularized Context Gates on Transformer for Machine Translation , 2019, ACL.

[14]  Yuan Cao,et al.  Echo State Neural Machine Translation , 2020, ArXiv.

[15]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[16]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[18]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[19]  Phil Blunsom,et al.  Teaching Machines to Read and Comprehend , 2015, NIPS.

[20]  Dan Klein,et al.  Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers , 2020, ArXiv.

[21]  Jingjing Xu,et al.  MUSE: Parallel Multi-Scale Attention for Sequence to Sequence Learning , 2019, ArXiv.

[22]  Lawrence D. Jackel,et al.  Backpropagation Applied to Handwritten Zip Code Recognition , 1989, Neural Computation.

[23]  Graham Neubig,et al.  Dynamic Data Selection and Weighting for Iterative Back-Translation , 2020, EMNLP.

[24]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[25]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[26]  Ali Farhadi,et al.  Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping , 2020, ArXiv.

[27]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[28]  Shuming Shi,et al.  Dynamic Layer Aggregation for Neural Machine Translation with Routing-by-Agreement , 2019, AAAI.

[29]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[30]  Xinlei Chen,et al.  Microsoft COCO Captions: Data Collection and Evaluation Server , 2015, ArXiv.

[31]  Tiejun Zhao,et al.  Self-Training for Unsupervised Neural Machine Translation in Unbalanced Training Data Scenarios , 2020, ArXiv.

[32]  Jie Chen,et al.  Attention on Attention for Image Captioning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[33]  Richard Socher,et al.  A Deep Reinforced Model for Abstractive Summarization , 2017, ICLR.

[34]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[35]  Myle Ott,et al.  fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[36]  Alexander M. Rush,et al.  OpenNMT: Open-Source Toolkit for Neural Machine Translation , 2017, ACL.

[37]  Jianfei Cai,et al.  Auto-Encoding Scene Graphs for Image Captioning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Christopher D. Manning,et al.  Get To The Point: Summarization with Pointer-Generator Networks , 2017, ACL.

[39]  Di He,et al.  Layer-Wise Coordination between Encoder and Decoder for Neural Machine Translation , 2018, NeurIPS.

[40]  Tengyu Ma,et al.  Fixup Initialization: Residual Learning Without Normalization , 2019, ICLR.

[41]  Xu Sun,et al.  Global Encoding for Abstractive Summarization , 2018, ACL.

[42]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[43]  Quoc V. Le,et al.  The Evolved Transformer , 2019, ICML.

[44]  Garrison W. Cottrell,et al.  ReZero is All You Need: Fast Convergence at Large Depth , 2020, UAI.

[45]  Jingbo Zhu,et al.  Learning Deep Transformer Models for Machine Translation , 2019, ACL.

[46]  Myle Ott,et al.  Understanding Back-Translation at Scale , 2018, EMNLP.

[47]  Shuming Shi,et al.  Neuron Interaction Based Representation Composition for Neural Machine Translation , 2020, AAAI.

[48]  Shuming Shi,et al.  Exploiting Deep Representations for Neural Machine Translation , 2018, EMNLP.

[49]  Tiejun Zhao,et al.  Knowledge Distillation for Multilingual Unsupervised Neural Machine Translation , 2020, ACL.

[50]  Tobias Domhan,et al.  How Much Attention Do You Need? A Granular Analysis of Neural Machine Translation Architectures , 2018, ACL.

[51]  Yejin Choi,et al.  Deep Communicating Agents for Abstractive Summarization , 2018, NAACL.

[52]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[53]  Xu Sun,et al.  Aligning Visual Regions and Textual Concepts for Semantic-Grounded Image Representations , 2019, NeurIPS.

[54]  Xu Sun,et al.  simNet: Stepwise Image-Topic Merging Network for Generating Detailed and Comprehensive Image Captions , 2018, EMNLP.

[55]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and VQA , 2017, ArXiv.

[56]  Di He,et al.  Dense Information Flow for Neural Machine Translation , 2018, NAACL.

[57]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.