Distributionally Robust Language Modeling

Language models are generally trained on data spanning a wide range of topics (e.g., news, reviews, fiction), but they might be applied to an a priori unknown target distribution (e.g., restaurant reviews). In this paper, we first show that training on text outside the test distribution can degrade test performance when using standard maximum likelihood (MLE) training. To remedy this without the knowledge of the test distribution, we propose an approach which trains a model that performs well over a wide range of potential test distributions. In particular, we derive a new distributionally robust optimization (DRO) procedure which minimizes the loss of the model over the worst-case mixture of topics with sufficient overlap with the training distribution. Our approach, called topic conditional value at risk (topic CVaR), obtains a 5.5 point perplexity reduction over MLE when the language models are trained on a mixture of Yelp reviews and news and tested only on reviews.

[1]  Andrea Esuli,et al.  Hierarchical Multi-label Conditional Random Fields for Aspect-Oriented Opinion Mining , 2014, ECIR.

[2]  Victor S. Lempitsky,et al.  Unsupervised Domain Adaptation by Backpropagation , 2014, ICML.

[3]  Bowen Zhou,et al.  Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond , 2016, CoNLL.

[4]  Tie-Yan Liu,et al.  LightLDA: Big Topic Models on Modest Computer Clusters , 2014, WWW.

[5]  Trevor Darrell,et al.  Discovering Latent Domains for Multisource Domain Adaptation , 2012, ECCV.

[6]  John C. Duchi,et al.  Certifiable Distributional Robustness with Principled Adversarial Training , 2017, ArXiv.

[7]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[8]  Yishay Mansour,et al.  Domain Adaptation with Multiple Sources , 2008, NIPS.

[9]  Trevor Darrell,et al.  Deep Domain Confusion: Maximizing for Domain Invariance , 2014, CVPR 2014.

[10]  R. Rockafellar,et al.  Optimization of conditional value-at risk , 2000 .

[11]  Quoc V. Le,et al.  Effective Domain Mixing for Neural Machine Translation , 2017, WMT.

[12]  Neil D. Lawrence,et al.  Dataset Shift in Machine Learning , 2009 .

[13]  Koby Crammer,et al.  A theory of learning from different domains , 2010, Machine Learning.

[14]  Mehryar Mohri,et al.  Algorithms and Theory for Multiple-Source Adaptation , 2018, NeurIPS.

[15]  Mark Johnson,et al.  An Improved Non-monotonic Transition System for Dependency Parsing , 2015, EMNLP.

[16]  Yann Dauphin,et al.  Convolutional Sequence to Sequence Learning , 2017, ICML.

[17]  John Blitzer,et al.  Domain Adaptation with Coupled Subspaces , 2011, AISTATS.

[18]  Hal Daumé,et al.  Frustratingly Easy Domain Adaptation , 2007, ACL.

[19]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[20]  Jianfeng Gao,et al.  A Neural Network Approach to Context-Sensitive Generation of Conversational Responses , 2015, NAACL.

[21]  Ashish Vaswani,et al.  Decoding with Large-Scale Neural Language Models Improves Translation , 2013, EMNLP.

[22]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[23]  Richard Socher,et al.  Learned in Translation: Contextualized Word Vectors , 2017, NIPS.

[24]  John C. Duchi,et al.  Stochastic Gradient Methods for Distributionally Robust Optimization with f-divergences , 2016, NIPS.

[25]  H. Shimodaira,et al.  Improving predictive inference under covariate shift by weighting the log-likelihood function , 2000 .

[26]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[27]  Jordan L. Boyd-Graber,et al.  Models for Dynamic Translation Model Adaptation , 2016 .

[28]  John Duchi,et al.  Statistics of Robust Optimization: A Generalized Empirical Likelihood Approach , 2016, Math. Oper. Res..

[29]  John C. Duchi,et al.  Certifying Some Distributional Robustness with Principled Adversarial Training , 2017, ICLR.

[30]  Percy Liang,et al.  Fairness Without Demographics in Repeated Loss Minimization , 2018, ICML.

[31]  Thorsten Brants,et al.  One billion word benchmark for measuring progress in statistical language modeling , 2013, INTERSPEECH.

[32]  Gang Niu,et al.  Does Distributionally Robust Supervised Learning Give Robust Classifiers? , 2016, ICML.

[33]  Quoc V. Le,et al.  Semi-supervised Sequence Learning , 2015, NIPS.

[34]  Jason J. Corso,et al.  Latent Domains Modeling for Visual Domain Adaptation , 2014, AAAI.

[35]  Vladimir Eidelman,et al.  Polylingual Tree-Based Topic Models for Translation Domain Adaptation , 2014, ACL.

[36]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[37]  Kristen Grauman,et al.  Reshaping Visual Datasets for Domain Adaptation , 2013, NIPS.

[38]  Anja De Waegenaere,et al.  Robust Solutions of Optimization Problems Affected by Uncertain Probabilities , 2011, Manag. Sci..

[39]  John C. Duchi,et al.  Learning Models with Uniform Performance via Distributionally Robust Optimization , 2018, ArXiv.

[40]  John C. Duchi,et al.  Variance-based Regularization with Convex Objectives , 2016, NIPS.