论文信息 - Structured Prediction via Learning to Search under Bandit Feedback - 字舞流文

Structured Prediction via Learning to Search under Bandit Feedback

We present an algorithm for structured prediction under online bandit feedback. The learner repeatedly predicts a sequence of actions, generating a structured output. It then observes feedback for that output and no others. We consider two cases: a pure bandit setting in which it only observes a loss, and more fine-grained feedback in which it observes a loss for every action. We find that the fine-grained feedback is necessary for strong empirical performance, because it allows for a robust variance-reduction strategy. We empirically compare a number of different algorithms and exploration methods and show the efficacy of BLS on sequence labeling and dependency parsing tasks.

Hal Daumé | Amr Sharaf | Hal Daumé | Amr Sharaf

[1] John Langford,et al. A Credit Assignment Compiler for Joint Prediction , 2014, NIPS.

[2] Alan Fern,et al. Discriminative Learning of Beam-Search Heuristics for Planning , 2007, IJCAI.

[3] Peter Auer,et al. The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[4] Alan Fern,et al. HC-Search: A Learning Framework for Search-based Structured Prediction , 2014, J. Artif. Intell. Res..

[5] Santosh S. Vempala,et al. Efficient algorithms for online decision problems , 2005, J. Comput. Syst. Sci..

[6] Akshay Krishnamurthy,et al. Efficient Contextual Semi-Bandit Learning , 2015, ArXiv.

[7] John Langford,et al. Doubly Robust Policy Evaluation and Optimization , 2014, ArXiv.

[8] Andreas Krause,et al. Information-Theoretic Regret Bounds for Gaussian Process Optimization in the Bandit Setting , 2009, IEEE Transactions on Information Theory.

[9] Yishay Mansour,et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[10] Shipra Agrawal,et al. Thompson Sampling for Contextual Bandits with Linear Payoffs , 2012, ICML.

[11] D. Horvitz,et al. A Generalization of Sampling Without Replacement from a Finite Universe , 1952 .

[12] Wei Chu,et al. A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[13] Stefan Riezler,et al. Bandit structured prediction for learning from partial feedback in statistical machine translation , 2016, MTSUMMIT.

[14] Beatrice Santorini,et al. Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[15] John Langford,et al. Contextual Bandit Algorithms with Supervised Learning Guarantees , 2010, AISTATS.

[16] Joakim Nivre,et al. Training Deterministic Parsers with Non-Deterministic Oracles , 2013, TACL.

[17] Brendan T. O'Connor,et al. Improved Part-of-Speech Tagging for Online Conversational Text with Word Clusters , 2013, NAACL.

[18] Hiroshi Nakagawa,et al. Optimal Regret Analysis of Thompson Sampling in Stochastic Multi-armed Bandit Problem with Multiple Plays , 2015, ICML.

[19] John Langford,et al. Efficient Optimal Learning for Contextual Bandits , 2011, UAI.

[20] Brian Roark,et al. Incremental Parsing with the Perceptron Algorithm , 2004, ACL.

[21] Samy Bengio,et al. Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks , 2015, NIPS.

[22] John Langford,et al. Learning to Search Better than Your Teacher , 2015, ICML.

[23] Joakim Nivre,et al. An Efficient Algorithm for Projective Dependency Parsing , 2003, IWPT.

[24] Alexander M. Rush,et al. Sequence-to-Sequence Learning as Beam-Search Optimization , 2016, EMNLP.

[25] Demis Hassabis,et al. Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[26] Dan Roth,et al. Design Challenges and Misconceptions in Named Entity Recognition , 2009, CoNLL.

[27] Nello Cristianini,et al. Finite-Time Analysis of Kernelised Contextual Bandits , 2013, UAI.

[28] H. Robbins. Some aspects of the sequential design of experiments , 1952 .

[29] J. Andrew Bagnell,et al. Reinforcement and Imitation Learning via Interactive No-Regret Learning , 2014, ArXiv.

[30] Andrew McCallum,et al. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[31] Veselin Stoyanov,et al. Empirical Risk Minimization of Graphical Model Parameters Given Approximate Inference, Decoding, and Model Structure , 2011, AISTATS.

[32] Percy Liang,et al. Learning Where to Sample in Structured Prediction , 2015, AISTATS.

[33] Lihong Li,et al. An Empirical Evaluation of Thompson Sampling , 2011, NIPS.

[34] John Langford,et al. The Epoch-Greedy Algorithm for Multi-armed Bandits with Side Information , 2007, NIPS.

[35] Stefan Riezler,et al. Learning Structured Predictors from Bandit Feedback for Interactive NLP , 2016, ACL.

[36] John Langford,et al. Search-based structured prediction , 2009, Machine Learning.

[37] Geoffrey J. Gordon,et al. A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[38] Peter Auer,et al. Using Confidence Bounds for Exploitation-Exploration Trade-offs , 2003, J. Mach. Learn. Res..

[39] Daniel Marcu,et al. Learning as search optimization: approximate large margin methods for structured prediction , 2005, ICML.