Under review as a conference paper at ICLR 2018

Traditional models for question answering optimize using cross entropy loss, which encourages exact answers at the cost of penalizing nearby or overlapping answers that are sometimes equally accurate. We propose a mixed objective that combines cross entropy loss with self-critical policy learning. The objective uses rewards derived from word overlap to solve the misalignment between evaluation metric and optimization objective. In addition to the mixed objective, we improve dynamic coattention networks (DCN) with a deep residual coattention encoder that is inspired by recent work in deep self-attention and residual networks. Our proposals improve model performance across question types and input lengths, especially for long questions that requires the ability to capture long-term dependencies. On the Stanford Question Answering Dataset, our model achieves state-of-the-art results with 75.1% exact match accuracy and 83.1% F1, while the ensemble obtains 78.9% exact match accuracy and 86.0% F1.

[1]  Dimitri P. Bertsekas,et al.  Constrained Optimization and Lagrange Multiplier Methods , 1982 .

[2]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[3]  John N. Tsitsiklis,et al.  Actor-Critic Algorithms , 1999, NIPS.

[4]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[5]  Peter L. Bartlett,et al.  Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning , 2001, J. Mach. Learn. Res..

[6]  Pascal Vincent,et al.  The Manifold Tangent Classifier , 2011, NIPS.

[7]  Nina Dethlefs,et al.  Combining Hierarchical Reinforcement Learning and Bayesian Networks for Natural Language Generation in Situated Dialogue , 2011, ENLG.

[8]  Yoshua Bengio,et al.  How transferable are features in deep neural networks? , 2014, NIPS.

[9]  Daniel Bienstock,et al.  Cutting-Planes for Optimization of Convex Functions over Nonconvex Sets , 2014, SIAM J. Optim..

[10]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[11]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[12]  Pieter Abbeel,et al.  Gradient Estimation Using Stochastic Computation Graphs , 2015, NIPS.

[13]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[14]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[15]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[16]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[17]  Jianfeng Gao,et al.  Deep Reinforcement Learning for Dialogue Generation , 2016, EMNLP.

[18]  Serge J. Belongie,et al.  Residual Networks Behave Like Ensembles of Relatively Shallow Networks , 2016, NIPS.

[19]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  George D. Magoulas,et al.  Deep Incremental Boosting , 2016, GCAI.

[21]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[22]  Yelong Shen,et al.  ReasoNet: Learning to Stop Reading in Machine Comprehension , 2016, CoCo@NIPS.

[23]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[24]  Jiasen Lu,et al.  Hierarchical Question-Image Co-Attention for Visual Question Answering , 2016, NIPS.

[25]  Joelle Pineau,et al.  An Actor-Critic Algorithm for Sequence Prediction , 2016, ICLR.

[26]  Jason Weston,et al.  Reading Wikipedia to Answer Open-Domain Questions , 2017, ACL.

[27]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[28]  Ali Farhadi,et al.  Bidirectional Attention Flow for Machine Comprehension , 2016, ICLR.

[29]  Dirk Weissenborn,et al.  Making Neural QA as Simple as Possible but not Simpler , 2017, CoNLL.

[30]  Geoffrey E. Hinton,et al.  Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , 2017, ICLR.

[31]  Yoshimasa Tsuruoka,et al.  A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks , 2016, EMNLP.

[32]  Richard Socher,et al.  Learned in Translation: Contextualized Word Vectors , 2017, NIPS.

[33]  Rui Liu,et al.  Structural Embedding of Syntactic Trees for Machine Comprehension , 2017, EMNLP.

[34]  Richard Socher,et al.  Dynamic Coattention Networks For Question Answering , 2016, ICLR.

[35]  Shuohang Wang,et al.  Machine Comprehension Using Match-LSTM and Answer Pointer , 2016, ICLR.

[36]  Richard Socher,et al.  A Deep Reinforced Model for Abstractive Summarization , 2017, ICLR.

[37]  Aleksander Madry,et al.  Towards Deep Learning Models Resistant to Adversarial Attacks , 2017, ICLR.

[38]  Roberto Cipolla,et al.  Multi-task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.