论文信息 - On the use of prior and external knowledge in neural sequence models

On the use of prior and external knowledge in neural sequence models

Neural sequencemodels have recently achieved great success across various natural language processing tasks. In practice, neural sequence models require massive amount of annotated training data to reach their desirable performance; however, there will not always be available data across languages, domains or tasks at hand. Prior and external knowledge provides additional contextual information, potentially improving the modelling performance as well as compensating the lack of large training data, particular in low-resourced situations. In this thesis, we investigate the usefulness of utilising prior and external knowledge for improving neural sequence models. We propose the use of various kinds of prior and external knowledge and present different approaches for integrating them into both training and inference phases of neural sequence models. The followings are main contributions of this thesis which are summarised in two major parts: We present the first part of this thesis which is on Training andModelling for neural sequence models. In this part, we investigate different situations (particularly in low resource settings) in which prior and external knowledge, such as side information, linguistic factors, monolingual data, is shown to have great benefits for improving performance of neural sequence models. In addition, we introduce a new means for incorporating prior and external knowledge based on the moment matching framework. This framework serves its purpose for exploiting prior and external knowledge as global features of generated sequences in neural sequence models in order to improve the overall quality of the desired output sequence. The second part is about Decoding of neural sequence models in which we propose a novel decoding framework with relaxed continuous optimisation in order to address one of the drawbacks of existing approximate decoding methods, namely the limited ability to incorporate global factors due to intractable search. We hope that this PhD thesis, constituted by two above major parts, will shed light on the use of prior and external knowledge in neural sequence models, both in their training and decoding phases.

Cong Duy Vu Hoang | Cong Duy Vu Hoang

[1] Quoc V. Le,et al. Don't Decay the Learning Rate, Increase the Batch Size , 2017, ICLR.

[2] Alexander M. Rush,et al. Abstractive Sentence Summarization with Attentive Recurrent Neural Networks , 2016, NAACL.

[3] David Chiang,et al. Improving Lexical Choice in Neural Machine Translation , 2017, NAACL.

[4] Philipp Koehn,et al. Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[5] Anders Søgaard,et al. Factored Translation with Unsupervised Word Clusters , 2011, WMT@EMNLP.

[6] Daniel Marcu,et al. Fast Decoding and Optimal Decoding for Machine Translation , 2001, ACL.

[7] Yee Whye Teh,et al. Bayesian Learning via Stochastic Gradient Langevin Dynamics , 2011, ICML.

[8] Phil Blunsom,et al. Recurrent Continuous Translation Models , 2013, EMNLP.

[9] Nan Hua,et al. Universal Sentence Encoder for English , 2018, EMNLP.

[10] Mauro Cettolo,et al. WIT3: Web Inventory of Transcribed and Translated Talks , 2012, EAMT.

[11] Rico Sennrich,et al. Edinburgh Neural Machine Translation Systems for WMT 16 , 2016, WMT.

[12] Scott Lundberg,et al. Checkpoint Ensembles: Ensemble Methods from a Single Training Process , 2017, ArXiv.

[13] Geoffrey E. Hinton. Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[14] Ankur Bapna,et al. The Best of Both Worlds: Combining Recent Advances in Neural Machine Translation , 2018, ACL.

[15] Yoshua Bengio,et al. Maxout Networks , 2013, ICML.

[16] Jan Niehues,et al. The IWSLT 2015 Evaluation Campaign , 2015, IWSLT.

[17] Yoshimasa Tsuruoka,et al. Neural Machine Translation with Source-Side Latent Graph Parsing , 2017, EMNLP.

[18] Yongxin Yang,et al. Trace Norm Regularised Deep Multi-Task Learning , 2016, ICLR.

[19] Jan Niehues,et al. Exploiting Linguistic Resources for Neural Machine Translation Using Multi-task Learning , 2017, WMT.

[20] Andrew McCallum,et al. End-to-End Learning for Structured Prediction Energy Networks , 2017, ICML.

[21] Razvan Pascanu,et al. Understanding the exploding gradient problem , 2012, ArXiv.

[22] Geoffrey E. Hinton,et al. Grammar as a Foreign Language , 2014, NIPS.

[23] Tiejun Zhao,et al. Forest-Based Neural Machine Translation , 2018, ACL.

[24] Andy Way,et al. Using Images to Improve Machine-Translating E-Commerce Product Listings. , 2017, EACL.

[25] Zhiguo Wang,et al. Coverage Embedding Models for Neural Machine Translation , 2016, EMNLP.

[26] Philipp Koehn,et al. Re-evaluating the Role of Bleu in Machine Translation Research , 2006, EACL.

[27] Rico Sennrich,et al. Evaluating Discourse Phenomena in Neural Machine Translation , 2017, NAACL.

[28] Avrim Blum,et al. The Bottleneck , 2021, Monopsony Capitalism.

[29] Caroline Brun,et al. Motivating Personality-aware Machine Translation , 2015, EMNLP.

[30] Yann Dauphin,et al. Convolutional Sequence to Sequence Learning , 2017, ICML.

[31] Min-Yen Kan,et al. Perspectives on crowdsourcing annotations for natural language processing , 2012, Language Resources and Evaluation.

[32] George Kurian,et al. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[33] Wojciech Zaremba,et al. Recurrent Neural Network Regularization , 2014, ArXiv.

[34] Jürgen Schmidhuber,et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[35] Samy Bengio,et al. Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks , 2015, NIPS.

[36] James Henderson,et al. Document-Level Neural Machine Translation with Hierarchical Attention Networks , 2018, EMNLP.

[37] Yoav Goldberg,et al. Towards String-To-Tree Neural Machine Translation , 2017, ACL.

[38] Chin-Yew Lin,et al. ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[39] Ivo D. Dinov,et al. Deep learning for neural networks , 2018 .

[40] Zoubin Ghahramani,et al. A Theoretically Grounded Application of Dropout in Recurrent Neural Networks , 2015, NIPS.

[41] Lemao Liu,et al. Neural Machine Translation with Source Dependency Representation , 2017, EMNLP.

[42] Gonzalo Iglesias,et al. Neural Machine Translation Decoding with Terminology Constraints , 2018, NAACL.

[43] Yoshua Bengio,et al. A Character-level Decoder without Explicit Segmentation for Neural Machine Translation , 2016, ACL.

[44] Yann Dauphin,et al. A Convolutional Encoder Model for Neural Machine Translation , 2016, ACL.

[45] Richard Socher,et al. Regularizing and Optimizing LSTM Language Models , 2017, ICLR.

[46] Stefan Riezler,et al. Analyzing Parallelism and Domain Similarities in the MAREC Patent Corpus , 2012, IRFC.

[47] Ryan Cotterell,et al. Explaining and Generalizing Back-Translation through Wake-Sleep , 2018, ArXiv.

[48] Hermann Ney,et al. Improved Statistical Alignment Models , 2000, ACL.

[49] Lukás Burget,et al. Strategies for training large scale neural network language models , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[50] Graham Neubig,et al. Extreme Adaptation for Personalized Neural Machine Translation , 2018, ACL.

[51] Pascal Vincent,et al. Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[52] Quoc V. Le,et al. Neural Optimizer Search with Reinforcement Learning , 2017, ICML.

[53] Akihiro Tamura,et al. Neural Machine Translation Incorporating Named Entity , 2018, COLING.

[54] Huanbo Luan,et al. Prior Knowledge Integration for Neural Machine Translation using Posterior Regularization , 2017, ACL.

[55] Matt Post,et al. Fast Lexically Constrained Decoding with Dynamic Beam Allocation for Neural Machine Translation , 2018, NAACL.

[56] Trevor Darrell,et al. Sequence to Sequence -- Video to Text , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[57] Kevin Duh,et al. DyNet: The Dynamic Neural Network Toolkit , 2017, ArXiv.

[58] Rongrong Ji,et al. Asynchronous Bidirectional Decoding for Neural Machine Translation , 2018, AAAI.

[59] Alex Graves,et al. Grid Long Short-Term Memory , 2015, ICLR.

[60] Phil Blunsom,et al. Teaching Machines to Read and Comprehend , 2015, NIPS.

[61] Xin Rong,et al. word2vec Parameter Learning Explained , 2014, ArXiv.

[62] Lijun Wu,et al. Achieving Human Parity on Automatic Chinese to English News Translation , 2018, ArXiv.

[63] Xavier Carreras,et al. Exponentiated gradient algorithms for log-linear structured prediction , 2007, ICML '07.

[64] Richard Socher,et al. An Analysis of Neural Language Modeling at Multiple Scales , 2018, ArXiv.

[65] Yishay Mansour,et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[66] Matt Post,et al. A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[67] Nick Campbell,et al. Doubly-Attentive Decoder for Multi-modal Neural Machine Translation , 2017, ACL.

[68] Marcello Federico,et al. Report on the 10th IWSLT evaluation campaign , 2013, IWSLT.

[69] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[70] Jianfeng Gao,et al. Domain Adaptation via Pseudo In-Domain Data Selection , 2011, EMNLP.

[71] Richard Socher,et al. Ask Me Anything: Dynamic Memory Networks for Natural Language Processing , 2015, ICML.

[72] Yoshua Bengio,et al. Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[73] Vysoké Učení,et al. Statistical Language Models Based on Neural Networks , 2012 .

[74] Elizabeth Salesky,et al. The AFRL-MITLL WMT17 Systems: Old, New, Borrowed, BLEU , 2017, WMT.

[75] Guillaume Lample,et al. Neural Architectures for Named Entity Recognition , 2016, NAACL.

[76] Jindrich Libovický,et al. End-to-End Non-Autoregressive Neural Machine Translation with Connectionist Temporal Classification , 2018, EMNLP.

[77] Eugene Charniak,et al. Statistical language learning , 1997 .

[78] Sebastian Ruder,et al. An Overview of Multi-Task Learning in Deep Neural Networks , 2017, ArXiv.

[79] Kenneth Heafield,et al. Copied Monolingual Data Improves Low-Resource Neural Machine Translation , 2017, WMT.

[80] Guillaume Lample,et al. Phrase-Based & Neural Unsupervised Machine Translation , 2018, EMNLP.

[81] Phil Blunsom,et al. Multilingual Distributed Representations without Word Alignment , 2013, ICLR 2014.

[82] Markus Freitag,et al. Fast Domain Adaptation for Neural Machine Translation , 2016, ArXiv.

[83] Marc'Aurelio Ranzato,et al. Sequence Level Training with Recurrent Neural Networks , 2015, ICLR.

[84] Lijun Wu,et al. A Study of Reinforcement Learning for Neural Machine Translation , 2018, EMNLP.

[85] Ben Taskar,et al. Learning Tractable Word Alignment Models with Complex Constraints , 2010, CL.

[86] Geoffrey J. Gordon,et al. A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[87] Chris Callison-Burch,et al. Open Source Toolkit for Statistical Machine Translation: Factored Translation Models and Lattice Decoding , 2006 .

[88] Y. Yao,et al. On Early Stopping in Gradient Descent Learning , 2007 .

[89] Satoshi Nakamura,et al. Neural Reranking Improves Subjective Quality of Machine Translation: NAIST at WAT2015 , 2015, WAT.

[90] Philipp Koehn,et al. Six Challenges for Neural Machine Translation , 2017, NMT@ACL.

[91] Yoshua Bengio,et al. Attention-Based Models for Speech Recognition , 2015, NIPS.

[92] Oriol Vinyals,et al. Learning Implicit Generative Models with the Method of Learned Moments , 2018, ICML.

[93] Satoshi Nakamura,et al. Guiding Neural Machine Translation with Retrieved Translation Pieces , 2018, NAACL.

[94] Manfred K. Warmuth,et al. Exponentiated Gradient Versus Gradient Descent for Linear Predictors , 1997, Inf. Comput..

[95] Hermann Ney,et al. Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[96] Sergei Nirenburg,et al. Knowledge-based machine translation , 1989, COLING.

[97] Victor O. K. Li,et al. Trainable Greedy Decoding for Neural Machine Translation , 2017, EMNLP.

[98] Stéphane Dupont,et al. An empirical study on the effectiveness of images in Multimodal Neural Machine Translation , 2017, EMNLP.

[99] Yun Chen,et al. A Stable and Effective Learning Strategy for Trainable Greedy Decoding , 2018, EMNLP.

[100] Yoshua Bengio,et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[101] Kyunghyun Cho,et al. Noisy Parallel Approximate Decoding for Conditional Recurrent Language Model , 2016, ArXiv.

[102] Ashwin K. Vijayakumar,et al. Diverse Beam Search for Improved Description of Complex Scenes , 2018, AAAI.

[103] Quoc V. Le,et al. Multi-task Sequence to Sequence Learning , 2015, ICLR.

[104] Chris Dyer,et al. Differentiable Scheduled Sampling for Credit Assignment , 2017, ACL.

[105] Rico Sennrich,et al. Linguistic Input Features Improve Neural Machine Translation , 2016, WMT.

[106] Gholamreza Haffari,et al. Neural Machine Translation for Bilingually Scarce Scenarios: a Deep Multi-Task Learning Approach , 2018, NAACL.

[107] Ruslan Salakhutdinov,et al. Multimodal Neural Language Models , 2014, ICML.

[108] Wenhu Chen,et al. Guided Alignment Training for Topic-Aware Neural Machine Translation , 2016, AMTA.

[109] Yoshua Bengio,et al. On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[110] George F. Foster,et al. Batch Tuning Strategies for Statistical Machine Translation , 2012, NAACL.

[111] Gholamreza Haffari,et al. Incorporating Structural Alignment Biases into an Attentional Neural Translation Model , 2016, NAACL.

[112] Hermann Ney,et al. LSTM Neural Networks for Language Modeling , 2012, INTERSPEECH.

[113] Razvan Pascanu,et al. How to Construct Deep Recurrent Neural Networks , 2013, ICLR.

[114] Yoram Singer,et al. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[115] Christopher D. Manning,et al. Stanford Neural Machine Translation Systems for Spoken Language Domains , 2015, IWSLT.

[116] Rico Sennrich,et al. The University of Edinburgh’s Neural MT Systems for WMT17 , 2017, WMT.

[117] Gholamreza Haffari,et al. Document Context Neural Machine Translation with Memory Networks , 2017, ACL.

[118] Rico Sennrich,et al. Has Machine Translation Achieved Human Parity? A Case for Document-level Evaluation , 2018, EMNLP.

[119] Chris Callison-Burch,et al. Learning Translations via Images with a Massively Multilingual Image Dataset , 2018, ACL.

[120] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[121] Kenneth Heafield,et al. Multi-Source Syntactic Neural Machine Translation , 2018, EMNLP.

[122] Lucia Specia,et al. Personalized Machine Translation: Preserving Original Author Traits , 2016, EACL.

[123] Tomas Mikolov,et al. Bag of Tricks for Efficient Text Classification , 2016, EACL.

[124] Eneko Agirre,et al. Unsupervised Neural Machine Translation , 2017, ICLR.

[125] Nenghai Yu,et al. Deliberation Networks: Sequence Generation Beyond One-Pass Decoding , 2017, NIPS.

[126] Ming Zhou,et al. Sequence-to-Dependency Neural Machine Translation , 2017, ACL.

[127] Miles Osborne,et al. Statistical Machine Translation , 2010, Encyclopedia of Machine Learning and Data Mining.

[128] Vesa Siivola,et al. Growing an n-gram language model , 2005, INTERSPEECH.

[129] Shujian Huang,et al. Combining Character and Word Information in Neural Machine Translation Using a Multi-Level Attention , 2018, NAACL.

[130] Rico Sennrich,et al. Controlling Politeness in Neural Machine Translation via Side Constraints , 2016, NAACL.

[131] Robert L. Mercer,et al. The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[132] Luke S. Zettlemoyer,et al. Neural Semantic Parsing , 2018, ACL.

[133] Yang Liu,et al. Coverage-based Neural Machine Translation , 2016, ArXiv.

[134] Andrei Popescu-Belis,et al. Self-Attentive Residual Decoder for Neural Machine Translation , 2017, NAACL.

[135] Tobias Domhan,et al. How Much Attention Do You Need? A Granular Analysis of Neural Machine Translation Architectures , 2018, ACL.

[136] Shi Feng,et al. Implicit Distortion and Fertility Models for Attention-based Encoder-Decoder NMT Model , 2016, ArXiv.

[137] Geoffrey E. Hinton,et al. Regularizing Neural Networks by Penalizing Confident Output Distributions , 2017, ICLR.

[138] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[139] Masashi Toyoda,et al. A Bag of Useful Tricks for Practical Neural Machine Translation: Embedding Layer Initialization and Large Batch Size , 2017, WAT@IJCNLP.

[140] Christopher D. Manning,et al. Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[141] Laura Jehl,et al. Document-Level Information as Side Constraints for Improved Neural Patent Translation , 2018, AMTA.

[142] Dianhai Yu,et al. Multi-Task Learning for Multiple Language Translation , 2015, ACL.

[143] Regina Barzilay,et al. Multi-Event Extraction Guided by Global Constraints , 2012, NAACL.

[144] Victor O. K. Li,et al. Non-Autoregressive Neural Machine Translation , 2017, ICLR.

[145] Geoffrey E. Hinton,et al. Learning internal representations by error propagation , 1986 .

[146] Daniel Jurafsky,et al. A Simple, Fast Diverse Decoding Algorithm for Neural Generation , 2016, ArXiv.

[147] Andy Way,et al. Learning to Jointly Translate and Predict Dropped Pronouns with a Shared Reconstruction Mechanism , 2018, EMNLP.

[148] Hermann Ney,et al. HMM-Based Word Alignment in Statistical Translation , 1996, COLING.

[149] Zachary Chase Lipton. A Critical Review of Recurrent Neural Networks for Sequence Learning , 2015, ArXiv.

[150] Alex Graves,et al. Neural Turing Machines , 2014, ArXiv.

[151] Slav Petrov,et al. Globally Normalized Transition-Based Neural Networks , 2016, ACL.

[152] Satoshi Nakamura,et al. Incorporating Discrete Translation Lexicons into Neural Machine Translation , 2016, EMNLP.

[153] Geoffrey E. Hinton,et al. The "wake-sleep" algorithm for unsupervised neural networks. , 1995, Science.

[154] Yoshua Bengio,et al. Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[155] Yann LeCun,et al. Regularization of Neural Networks using DropConnect , 2013, ICML.

[156] Bolyai János Matematikai Társulat,et al. Theory of algorithms , 1985 .

[157] Kenneth Heafield,et al. KenLM: Faster and Smaller Language Model Queries , 2011, WMT@EMNLP.

[158] Ondrej Bojar,et al. Improving Translation Model by Monolingual Data , 2011, WMT@EMNLP.

[159] Christopher D. Manning,et al. Achieving Open Vocabulary Neural Machine Translation with Hybrid Word-Character Models , 2016, ACL.

[160] David Chiang,et al. Correcting Length Bias in Neural Machine Translation , 2018, WMT.

[161] Diyi Yang,et al. Hierarchical Attention Networks for Document Classification , 2016, NAACL.

[162] Yoshua Bengio,et al. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[163] Mohamed Chtourou,et al. On the training of recurrent neural networks , 2011, Eighth International Multi-Conference on Systems, Signals & Devices.

[164] Rico Sennrich,et al. Context-Aware Neural Machine Translation Learns Anaphora Resolution , 2018, ACL.

[165] Yoshua Bengio,et al. On Using Monolingual Corpora in Neural Machine Translation , 2015, ArXiv.

[166] Jason Weston,et al. A Neural Attention Model for Abstractive Sentence Summarization , 2015, EMNLP.

[167] Marc'Aurelio Ranzato,et al. Classical Structured Prediction Losses for Sequence to Sequence Learning , 2017, NAACL.

[168] Stefan Riezler,et al. Structural and Topical Dimensions in Multi-Task Patent Translation , 2012, EACL.

[169] Qun Liu,et al. Lexically Constrained Decoding for Sequence Generation Using Grid Beam Search , 2017, ACL.

[170] Philipp Koehn,et al. Findings of the 2017 Conference on Machine Translation (WMT17) , 2017, WMT.

[171] Andrew McCallum,et al. Alternating Projections for Learning with Expectation Constraints , 2009, UAI.

[172] Noah A. Smith,et al. A Simple, Fast, and Effective Reparameterization of IBM Model 2 , 2013, NAACL.

[173] Tie-Yan Liu,et al. Dual Learning for Machine Translation , 2016, NIPS.

[174] Eliyahu Kiperwasser,et al. Scheduled Multi-Task Learning: From Syntax to Translation , 2018, TACL.

[175] Sander M. Bohte,et al. Editorial: Artificial Neural Networks as Models of Neural Information Processing , 2017, Front. Comput. Neurosci..

[176] Adam L. Berger,et al. A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[177] Martin Wattenberg,et al. Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation , 2016, TACL.

[178] G. Evans,et al. Learning to Optimize , 2008 .

[179] Christof Monz,et al. Data Augmentation for Low-Resource Neural Machine Translation , 2017, ACL.

[180] Myle Ott,et al. Understanding Back-Translation at Scale , 2018, EMNLP.

[181] Ben Taskar,et al. Posterior vs Parameter Sparsity in Latent Variable Models , 2009, NIPS.

[182] Daniel Marcu,et al. Statistical Phrase-Based Translation , 2003, NAACL.

[183] Yiming Yang,et al. MMD GAN: Towards Deeper Understanding of Moment Matching Network , 2017, NIPS.

[184] Quoc V. Le,et al. Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[185] Naftali Tishby,et al. Incorporating Prior Knowledge on Features into Learning , 2007, AISTATS.

[186] Xiaocheng Feng,et al. Adaptive Multi-pass Decoder for Neural Machine Translation , 2018, EMNLP.

[187] Melvin Johnson,et al. Direct speech-to-speech translation with a sequence-to-sequence model , 2019, INTERSPEECH.

[188] Rico Sennrich,et al. Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[189] Nikolaus Hansen,et al. The CMA Evolution Strategy: A Tutorial , 2016, ArXiv.

[190] Frank Hutter,et al. SGDR: Stochastic Gradient Descent with Restarts , 2016, ArXiv.

[191] Kilian Q. Weinberger,et al. Snapshot Ensembles: Train 1, get M for free , 2017, ICLR.

[192] Yoshua Bengio,et al. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[193] Xi Chen,et al. Evolution Strategies as a Scalable Alternative to Reinforcement Learning , 2017, ArXiv.

[194] Christopher Joseph Pal,et al. Twin Networks: Using the Future as a Regularizer , 2017, ArXiv.

[195] Ashish Vaswani,et al. Self-Attention with Relative Position Representations , 2018, NAACL.

[196] Jeffrey Dean,et al. Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[197] Mingbo Ma,et al. Breaking the Beam Search Curse: A Study of (Re-)Scoring Methods and Stopping Criteria for Neural Machine Translation , 2018, EMNLP.

[198] Brenden M. Lake,et al. Learning Inductive Biases with Simple Neural Networks , 2018, CogSci.

[199] Phil Blunsom,et al. Multilingual Models for Compositional Distributed Semantics , 2014, ACL.

[200] Rico Sennrich,et al. Deep architectures for Neural Machine Translation , 2017, WMT.

[201] Matt Post,et al. The Sockeye Neural Machine Translation Toolkit at AMTA 2018 , 2018, AMTA.

[202] Yang Liu,et al. Learning to Remember Translation History with a Continuous Cache , 2017, TACL.

[203] Qun Liu,et al. Incorporating Global Visual Features into Attention-based Neural Machine Translation. , 2017, EMNLP.

[204] Ekaterina Vylomova,et al. Depth-Gated LSTM , 2015, ArXiv.

[205] Samy Bengio,et al. Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[206] Emanuel Snelleman. Decoding neural machine translation using gradient descent , 2016 .

[207] Laurent Besacier,et al. Pervasive Attention: 2D Convolutional Neural Networks for Sequence-to-Sequence Prediction , 2018, CoNLL.

[208] Jeffrey Dean,et al. Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[209] F ChenStanley,et al. An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[210] Philipp Koehn,et al. Statistical Significance Tests for Machine Translation Evaluation , 2004, EMNLP.

[211] Taro Watanabe,et al. Bidirectional Decoding for Statistical Machine Translation , 2002, COLING.

[212] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[213] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[214] Wilker Aziz,et al. A Stochastic Decoder for Neural Machine Translation , 2018, ACL.

[215] James H. Martin,et al. Speech and Language Processing, 2nd Edition , 2008 .

[216] Kevin Knight,et al. Multi-Source Neural Translation , 2016, NAACL.

[217] Philip Gage,et al. A new algorithm for data compression , 1994 .