Online Markov Decision Processes with Aggregate Bandit Feedback
暂无分享,去创建一个
Haim Kaplan | Yishay Mansour | Alon Cohen | Tomer Koren | Y. Mansour | Tomer Koren | Haim Kaplan | Alon Cohen
[1] Csaba Szepesvari,et al. Bandit Algorithms , 2020 .
[2] Elad Hazan,et al. Volumetric Spanners: An Efficient Exploration Basis for Learning , 2013, J. Mach. Learn. Res..
[3] Shie Mannor,et al. Reinforcement Learning with Trajectory Feedback , 2020, ArXiv.
[4] Peter Auer,et al. Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..
[5] Tor Lattimore,et al. Learning with Good Feature Representations in Bandits and in RL with a Generative Model , 2020, ICML.
[6] Yishay Mansour,et al. Online Stochastic Shortest Path with Bandit Feedback and Unknown Transition Function , 2019, NeurIPS.
[7] Avrim Blum,et al. Online Geometric Optimization in the Bandit Setting Against an Adaptive Adversary , 2004, COLT.
[8] Emma Brunskill,et al. Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds , 2019, ICML.
[9] Rémi Munos,et al. Minimax Regret Bounds for Reinforcement Learning , 2017, ICML.
[10] Yishay Mansour,et al. Online Markov Decision Processes , 2009, Math. Oper. Res..
[11] Thomas P. Hayes,et al. The Price of Bandit Information for Online Optimization , 2007, NIPS.
[12] E. Ordentlich,et al. Inequalities for the L1 Deviation of the Empirical Distribution , 2003 .
[13] Benjamin Van Roy,et al. On Lower Bounds for Regret in Reinforcement Learning , 2016, ArXiv.
[14] Yishay Mansour,et al. Improved second-order bounds for prediction with expert advice , 2006, Machine Learning.
[15] Sham M. Kakade,et al. Towards Minimax Policies for Online Linear Optimization with Bandit Feedback , 2012, COLT.
[16] Haipeng Luo,et al. Bias no more: high-probability data-dependent regret bounds for adversarial bandits and MDPs , 2020, Neural Information Processing Systems.
[17] Haipeng Luo,et al. Learning Adversarial Markov Decision Processes with Bandit Feedback and Unknown Transition , 2020, ICML.
[18] Thomas P. Hayes,et al. High-Probability Regret Bounds for Bandit Online Linear Optimization , 2008, COLT.
[19] Elad Hazan,et al. Competing in the Dark: An Efficient Algorithm for Bandit Linear Optimization , 2008, COLT.
[20] Csaba Szepesvári,et al. Online Markov Decision Processes Under Bandit Feedback , 2010, IEEE Transactions on Automatic Control.
[21] Peter L. Bartlett,et al. Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions , 2013, NIPS.
[22] Baruch Awerbuch,et al. Adaptive routing with end-to-end feedback: distributed learning and geometric approaches , 2004, STOC '04.
[23] Yin Tat Lee,et al. Kernel-based methods for bandit convex optimization , 2016, STOC.
[24] Aditya Gopalan,et al. Misspecified Linear Bandits , 2017, AAAI.
[25] Haipeng Luo,et al. Corralling a Band of Bandit Algorithms , 2016, COLT.