暂无分享,去创建一个
[1] D. Rubin,et al. The central role of the propensity score in observational studies for causal effects , 1983 .
[2] Lihong Li,et al. An Empirical Evaluation of Thompson Sampling , 2011, NIPS.
[3] Doina Precup,et al. Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.
[4] Hal Daumé,et al. Reinforcement Learning for Bandit Neural Machine Translation with Simulated Human Feedback , 2017, EMNLP.
[5] Stefan Riezler,et al. Counterfactual Learning from Bandit Feedback under Deterministic Logging : A Case Study in Statistical Machine Translation , 2017, EMNLP.
[6] Nan Jiang,et al. Doubly Robust Off-policy Value Evaluation for Reinforcement Learning , 2015, ICML.
[7] Philip S. Thomas,et al. Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning , 2016, ICML.
[8] Stefan Riezler,et al. Stochastic Structured Prediction under Bandit Feedback , 2016, NIPS.
[9] Stefan Riezler,et al. Bandit Structured Prediction for Neural Sequence-to-Sequence Learning , 2017, ACL.
[10] John Langford,et al. Doubly Robust Policy Evaluation and Learning , 2011, ICML.
[11] Thorsten Joachims,et al. The Self-Normalized Estimator for Counterfactual Learning , 2015, NIPS.
[12] Joaquin Quiñonero Candela,et al. Counterfactual reasoning and learning systems: the example of computational advertising , 2013, J. Mach. Learn. Res..
[13] Khashayar Khosravi,et al. Exploiting the Natural Exploration In Contextual Bandits , 2017, ArXiv.