Understanding and Improving Hidden Representations for Neural Machine Translation

Multilayer architectures are currently the gold standard for large-scale neural machine translation. Existing works have explored some methods for understanding the hidden representations, however, they have not sought to improve the translation quality rationally according to their understanding. Towards understanding for performance improvement, we first artificially construct a sequence of nested relative tasks and measure the feature generalization ability of the learned hidden representation over these tasks. Based on our understanding, we then propose to regularize the layer-wise representations with all tree-induced tasks. To overcome the computational bottleneck resulting from the large number of regularization terms, we design efficient approximation methods by selecting a few coarse-to-fine tasks for regularization. Extensive experiments on two widely-used datasets demonstrate the proposed methods only lead to small extra overheads in training but no additional overheads in testing, and achieve consistent improvements (up to +1.3 BLEU) compared to the state-of-the-art translation model.

[1]  Enhong Chen,et al.  Coarse-To-Fine Learning for Neural Machine Translation , 2018, NLPCC.

[2]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[3]  Gholamreza Haffari,et al.  Adaptive Knowledge Sharing in Multi-Task Learning: Improving Low-Resource Neural Machine Translation , 2018, ACL.

[4]  Yoshimasa Tsuruoka,et al.  Learning to Parse and Translate Improves Neural Machine Translation , 2017, ACL.

[5]  Yonatan Belinkov,et al.  What do Neural Machine Translation Models Learn about Morphology? , 2017, ACL.

[6]  Yann Dauphin,et al.  Convolutional Sequence to Sequence Learning , 2017, ICML.

[7]  Yu Cheng,et al.  Diverse Few-Shot Text Classification with Multiple Metrics , 2018, NAACL.

[8]  Myle Ott,et al.  Scaling Neural Machine Translation , 2018, WMT.

[9]  Lukasz Kaiser,et al.  Universal Transformers , 2018, ICLR.

[10]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[11]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[12]  Xing Shi,et al.  Does String-Based Neural MT Learn Source Syntax? , 2016, EMNLP.

[13]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[14]  Yoshua Bengio,et al.  How transferable are features in deep neural networks? , 2014, NIPS.

[15]  Zaixiang Zheng,et al.  Neural Machine Translation with Word Predictions , 2017, EMNLP.

[16]  Yonatan Belinkov,et al.  Evaluating Layers of Representation in Neural Machine Translation on Part-of-Speech and Semantic Tagging Tasks , 2017, IJCNLP.

[17]  Leonidas J. Guibas,et al.  Taskonomy: Disentangling Task Transfer Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[18]  Yang Liu,et al.  Visualizing and Understanding Neural Machine Translation , 2017, ACL.

[19]  Mirella Lapata,et al.  Coarse-to-Fine Decoding for Neural Semantic Parsing , 2018, ACL.

[20]  Alexander M. Rush,et al.  Seq2seq-Vis: A Visual Debugging Tool for Sequence-to-Sequence Models , 2018, IEEE Transactions on Visualization and Computer Graphics.

[21]  Jan Niehues,et al.  Exploiting Linguistic Resources for Neural Machine Translation Using Multi-task Learning , 2017, WMT.

[22]  Eliyahu Kiperwasser,et al.  Scheduled Multi-Task Learning: From Syntax to Translation , 2018, TACL.

[23]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[24]  John Langford,et al.  Search-based structured prediction , 2009, Machine Learning.

[25]  Greg Mori,et al.  Learning Structured Inference Neural Networks with Label Relations , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Ankur Bapna,et al.  The Best of Both Worlds: Combining Recent Advances in Neural Machine Translation , 2018, ACL.

[27]  Karl Stratos,et al.  A Spectral Algorithm for Learning Class-Based n-gram Models of Natural Language , 2014, UAI.

[28]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[29]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[30]  Yoshua Bengio,et al.  Residual Connections Encourage Iterative Inference , 2017, ICLR.

[31]  Raquel Urtasun,et al.  Dimensionality Reduction for Representing the Knowledge of Probabilistic Models , 2018, International Conference on Learning Representations.