Learning and Planning in Average-Reward Markov Decision Processes
暂无分享,去创建一个
Richard S. Sutton | Abhishek Naik | Yi Wan | R. Sutton | Yi Wan | A. Naik
[1] Vivek S. Borkar,et al. Learning Algorithms for Markov Decision Processes with Average Cost , 2001, SIAM J. Control. Optim..
[2] Peter Auer,et al. Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..
[3] Satinder P. Singh,et al. Reinforcement Learning Algorithms for Average-Payoff Markovian Decision Processes , 1994, AAAI.
[4] Doina Precup,et al. Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..
[5] D. White,et al. Dynamic programming, Markov chains, and the method of successive approximations , 1963 .
[6] Nevena Lazic,et al. Exploration-Enhanced POLITEX , 2019, ArXiv.
[7] Shalabh Bhatnagar,et al. Natural actor-critic algorithms , 2009, Autom..
[8] Dimitri P. Bertsekas,et al. Convergence Results for Some Temporal Difference Methods Based on Least Squares , 2009, IEEE Transactions on Automatic Control.
[9] Bo Dai,et al. GenDICE: Generalized Offline Estimation of Stationary Values , 2020, ICLR.
[10] John N. Tsitsiklis,et al. Average cost temporal-difference learning , 1997, Proceedings of the 36th IEEE Conference on Decision and Control.
[11] Richard Wheeler,et al. Decentralized learning in finite Markov chains , 1985, 1985 24th IEEE Conference on Decision and Control.
[12] Abhijit Gosavi,et al. Reinforcement learning for long-run average cost , 2004, Eur. J. Oper. Res..
[13] S. Mahadevan,et al. Solving Semi-Markov Decision Problems Using Average Reward Reinforcement Learning , 1999 .
[14] Bo Dai,et al. Batch Stationary Distribution Estimation , 2020, ICML.
[15] Patrick M. Pilarski,et al. Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction , 2011, AAMAS.
[16] John N. Tsitsiklis,et al. Simulation-based optimization of Markov reward processes , 2001, IEEE Trans. Autom. Control..
[17] S. Whiteson,et al. GradientDICE: Rethinking Generalized Offline Estimation of Stationary Values , 2020, ICML.
[18] Yang Gao,et al. Efficient Average Reward Reinforcement Learning Using Constant Shifting Values , 2016, AAAI.
[19] Peter L. Bartlett,et al. POLITEX: Regret Bounds for Policy Iteration using Expert Prediction , 2019, ICML.
[20] Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .
[21] John N. Tsitsiklis,et al. Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.
[22] P. Schrimpf,et al. Dynamic Programming , 2011 .
[23] Paul J. Schweitzer,et al. The Functional Equations of Undiscounted Markov Renewal Programming , 1971, Math. Oper. Res..
[24] Peter Auer,et al. Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning , 2006, NIPS.
[25] Ronald A. Howard,et al. Dynamic Programming and Markov Processes , 1960 .
[26] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.
[27] Sean P. Meyn,et al. The O.D.E. Method for Convergence of Stochastic Approximation and Reinforcement Learning , 2000, SIAM J. Control. Optim..
[28] Anton Schwartz,et al. A Reinforcement Learning Method for Maximizing Undiscounted Rewards , 1993, ICML.
[29] Qiang Liu,et al. Black-box Off-policy Estimation for Infinite-Horizon Reinforcement Learning , 2020, ICLR.
[30] Yishay Mansour,et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.
[31] Sridhar Mahadevan,et al. Average reward reinforcement learning: Foundations, algorithms, and empirical results , 2004, Machine Learning.
[32] Vijay R. Konda,et al. Actor-Critic Algorithms , 1999, NIPS.
[33] A. Jalali,et al. A distributed asynchronous algorithm for expected average cost dynamic programming , 1990, 29th IEEE Conference on Decision and Control.
[34] Sham M. Kakade,et al. A Natural Policy Gradient , 2001, NIPS.
[35] Richard S. Sutton,et al. Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.
[36] Qiang Liu,et al. Doubly Robust Bias Reduction in Infinite Horizon Off-Policy Estimation , 2019, ICLR.
[37] M. Gallagher,et al. Average-reward model-free reinforcement learning: a systematic review and literature mapping , 2020, ArXiv.
[38] Ronen I. Brafman,et al. R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..
[39] Zhiyuan Ren,et al. Adaptive control of Markov chains with average cost , 2001, IEEE Trans. Autom. Control..
[40] A. Jalali,et al. Computationally efficient adaptive control algorithms for Markov chains , 1989, Proceedings of the 28th IEEE Conference on Decision and Control,.
[41] V. Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint , 2008 .
[42] Michael Kearns,et al. Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.
[43] V. Borkar. Asynchronous Stochastic Approximations , 1998 .
[44] Qiang Liu,et al. Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation , 2018, NeurIPS.
[45] Martha White,et al. Comparing Direct and Indirect Temporal-Difference Methods for Estimating the Variance of the Return , 2018, UAI.
[46] Discounted Reinforcement Learning is Not an Optimization Problem , 2019, ArXiv.
[47] M. Dahleh. Laboratory for Information and Decision Systems , 2005 .