Online Stochastic Shortest Path with Bandit Feedback and Unknown Transition Function
暂无分享,去创建一个
[1] Gergely Neu,et al. Online learning in episodic Markovian decision processes by relative entropy policy search , 2013, NIPS.
[2] Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .
[3] Gábor Lugosi,et al. Prediction, learning, and games , 2006 .
[4] Peter Auer,et al. Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..
[5] Shie Mannor,et al. Markov Decision Processes with Arbitrary Reward Processes , 2008, Math. Oper. Res..
[6] Yishay Mansour,et al. Online Markov Decision Processes , 2009, Math. Oper. Res..
[7] Ambuj Tewari,et al. REGAL: A Regularization based Algorithm for Reinforcement Learning in Weakly Communicating MDPs , 2009, UAI.
[8] András György,et al. The adversarial stochastic shortest path problem with unknown transition probabilities , 2012, AISTATS.
[9] Yishay Mansour,et al. Online Convex Optimization in Adversarial Markov Decision Processes , 2019, ICML.
[10] Emma Brunskill,et al. Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds , 2019, ICML.
[11] Csaba Szepesvári,et al. Online Markov Decision Processes Under Bandit Feedback , 2010, IEEE Transactions on Automatic Control.
[12] Michael I. Jordan,et al. Is Q-learning Provably Efficient? , 2018, NeurIPS.
[13] Rémi Munos,et al. Minimax Regret Bounds for Reinforcement Learning , 2017, ICML.