论文信息 - A Reinforcement Learning Algorithm with Polynomial Interaction Complexity for Only-Costly-Observable MDPs

A Reinforcement Learning Algorithm with Polynomial Interaction Complexity for Only-Costly-Observable MDPs

An Unobservable MDP (UMDP) is a POMDP in which there are no observations. An Only-Costly-Observable MDP (OCOMDP) is a POMDP which extends an UMDP by allowing a particular costly action which completely observes the state. We introduce UR-MAX, a reinforcement learning algorithm with polynomial interaction complexity for unknown OCOMDPs.

Moshe Tennenholtz | Roy Fox

[1] John N. Tsitsiklis,et al. The Complexity of Markov Decision Processes , 1987, Math. Oper. Res..

[2] Ronen I. Brafman,et al. R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[3] Yishay Mansour,et al. Reinforcement Learning in POMDPs Without Resets , 2005, IJCAI.

[4] Michael Kearns,et al. Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[5] Samuel W. Hasinoff,et al. Reinforcement Learning for Problems with Hidden State , 2003 .

[6] Edward J. Sondik,et al. The Optimal Control of Partially Observable Markov Processes over a Finite Horizon , 1973, Oper. Res..

[7] Shlomo Zilberstein,et al. Reinforcement Learning for Mixed Open-loop and Closed-loop Control , 1996, NIPS.