Structured Prediction via Learning to Search under Bandit Feedback

We present an algorithm for structured prediction under online bandit feedback. The learner repeatedly predicts a sequence of actions, generating a structured output. It then observes feedback for that output and no others. We consider two cases: a pure bandit setting in which it only observes a loss, and more fine-grained feedback in which it observes a loss for every action. We find that the fine-grained feedback is necessary for strong empirical performance, because it allows for a robust variance-reduction strategy. We empirically compare a number of different algorithms and exploration methods and show the efficacy of BLS on sequence labeling and dependency parsing tasks.

[1]  John Langford,et al.  A Credit Assignment Compiler for Joint Prediction , 2014, NIPS.

[2]  Alan Fern,et al.  Discriminative Learning of Beam-Search Heuristics for Planning , 2007, IJCAI.

[3]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[4]  Alan Fern,et al.  HC-Search: A Learning Framework for Search-based Structured Prediction , 2014, J. Artif. Intell. Res..

[5]  Santosh S. Vempala,et al.  Efficient algorithms for online decision problems , 2005, J. Comput. Syst. Sci..

[6]  Akshay Krishnamurthy,et al.  Efficient Contextual Semi-Bandit Learning , 2015, ArXiv.

[7]  John Langford,et al.  Doubly Robust Policy Evaluation and Optimization , 2014, ArXiv.

[8]  Andreas Krause,et al.  Information-Theoretic Regret Bounds for Gaussian Process Optimization in the Bandit Setting , 2009, IEEE Transactions on Information Theory.

[9]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[10]  Shipra Agrawal,et al.  Thompson Sampling for Contextual Bandits with Linear Payoffs , 2012, ICML.

[11]  D. Horvitz,et al.  A Generalization of Sampling Without Replacement from a Finite Universe , 1952 .

[12]  Wei Chu,et al.  A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[13]  Stefan Riezler,et al.  Bandit structured prediction for learning from partial feedback in statistical machine translation , 2016, MTSUMMIT.

[14]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[15]  John Langford,et al.  Contextual Bandit Algorithms with Supervised Learning Guarantees , 2010, AISTATS.

[16]  Joakim Nivre,et al.  Training Deterministic Parsers with Non-Deterministic Oracles , 2013, TACL.

[17]  Brendan T. O'Connor,et al.  Improved Part-of-Speech Tagging for Online Conversational Text with Word Clusters , 2013, NAACL.

[18]  Hiroshi Nakagawa,et al.  Optimal Regret Analysis of Thompson Sampling in Stochastic Multi-armed Bandit Problem with Multiple Plays , 2015, ICML.

[19]  John Langford,et al.  Efficient Optimal Learning for Contextual Bandits , 2011, UAI.

[20]  Brian Roark,et al.  Incremental Parsing with the Perceptron Algorithm , 2004, ACL.

[21]  Samy Bengio,et al.  Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks , 2015, NIPS.

[22]  John Langford,et al.  Learning to Search Better than Your Teacher , 2015, ICML.

[23]  Joakim Nivre,et al.  An Efficient Algorithm for Projective Dependency Parsing , 2003, IWPT.

[24]  Alexander M. Rush,et al.  Sequence-to-Sequence Learning as Beam-Search Optimization , 2016, EMNLP.

[25]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[26]  Dan Roth,et al.  Design Challenges and Misconceptions in Named Entity Recognition , 2009, CoNLL.

[27]  Nello Cristianini,et al.  Finite-Time Analysis of Kernelised Contextual Bandits , 2013, UAI.

[28]  H. Robbins Some aspects of the sequential design of experiments , 1952 .

[29]  J. Andrew Bagnell,et al.  Reinforcement and Imitation Learning via Interactive No-Regret Learning , 2014, ArXiv.

[30]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[31]  Veselin Stoyanov,et al.  Empirical Risk Minimization of Graphical Model Parameters Given Approximate Inference, Decoding, and Model Structure , 2011, AISTATS.

[32]  Percy Liang,et al.  Learning Where to Sample in Structured Prediction , 2015, AISTATS.

[33]  Lihong Li,et al.  An Empirical Evaluation of Thompson Sampling , 2011, NIPS.

[34]  John Langford,et al.  The Epoch-Greedy Algorithm for Multi-armed Bandits with Side Information , 2007, NIPS.

[35]  Stefan Riezler,et al.  Learning Structured Predictors from Bandit Feedback for Interactive NLP , 2016, ACL.

[36]  John Langford,et al.  Search-based structured prediction , 2009, Machine Learning.

[37]  Geoffrey J. Gordon,et al.  A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[38]  Peter Auer,et al.  Using Confidence Bounds for Exploitation-Exploration Trade-offs , 2003, J. Mach. Learn. Res..

[39]  Daniel Marcu,et al.  Learning as search optimization: approximate large margin methods for structured prediction , 2005, ICML.