Efficient Bayesian Nonparametric Methods for Model-Free Reinforcement Learning in Centralized and Decentralized Sequential Environments

Efficient Bayesian Nonparametric Methods for Model-Free Reinforcement Learning in Centralized and Decentralized Sequential Environments by Miao Liu Department of Electrical and Computer Engineering Duke University

[1]  Anne Condon,et al.  On the Undecidability of Probabilistic Planning and Infinite-Horizon Partially Observable Markov Decision Problems , 1999, AAAI/IAAI.

[2]  Edward J. Sondik,et al.  The Optimal Control of Partially Observable Markov Processes over a Finite Horizon , 1973, Oper. Res..

[3]  Christian P. Robert,et al.  Monte Carlo Statistical Methods , 2005, Springer Texts in Statistics.

[4]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[5]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[6]  Eric Moulines,et al.  On‐line expectation–maximization algorithm for latent data models , 2007, ArXiv.

[7]  Nicholas R. J. Lawrance,et al.  Gaussian processes for informative exploration in reinforcement learning , 2013, 2013 IEEE International Conference on Robotics and Automation.

[8]  Peter Dayan,et al.  Scalable and Efficient Bayes-Adaptive Reinforcement Learning Based on Monte-Carlo Tree Search , 2013, J. Artif. Intell. Res..

[9]  Kee-Eung Kim,et al.  Solving POMDPs by Searching the Space of Finite Policies , 1999, UAI.

[10]  Yoram Singer,et al.  The Hierarchical Hidden Markov Model: Analysis and Applications , 1998, Machine Learning.

[11]  Matthijs T. J. Spaan,et al.  Multi-robot planning under uncertainty with communication: a case study , 2010 .

[12]  Edward J. Sondik,et al.  The Optimal Control of Partially Observable Markov Processes over the Infinite Horizon: Discounted Costs , 1978, Oper. Res..

[13]  Hui Li,et al.  Multi-task Reinforcement Learning in Partially Observable Stochastic Environments , 2009, J. Mach. Learn. Res..

[14]  W. Haddad,et al.  Nonlinear Dynamical Systems and Control: A Lyapunov-Based Approach , 2008 .

[15]  Shlomo Zilberstein,et al.  Increasing scalability in algorithms for centralized and decentralized partially observable markov decision processes: efficient decision-making and coordination in uncertain environments , 2010 .

[16]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[17]  Marc Peter Deisenroth,et al.  Efficient reinforcement learning using Gaussian processes , 2010 .

[18]  Carlos Guestrin,et al.  Multiagent Planning with Factored MDPs , 2001, NIPS.

[19]  Eric Wiewiora,et al.  Learning predictive representations from a history , 2005, ICML.

[20]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[21]  Joelle Pineau,et al.  Anytime Point-Based Approximations for Large POMDPs , 2006, J. Artif. Intell. Res..

[22]  Steve J. Young,et al.  USING POMDPS FOR DIALOG MANAGEMENT , 2006, 2006 IEEE Spoken Language Technology Workshop.

[23]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[24]  Lawrence Carin,et al.  Learning to Explore and Exploit in POMDPs , 2009, NIPS.

[25]  Gavin Taylor,et al.  Kernelized value function approximation for reinforcement learning , 2009, ICML '09.

[26]  L. Carin,et al.  Transfer Learning for Reinforcement Learning with Dependent Dirichlet Process and Gaussian Process , 2012 .

[27]  Chong Wang,et al.  Variational inference in nonconjugate models , 2012, J. Mach. Learn. Res..

[28]  Leslie Pack Kaelbling,et al.  Bayesian Policy Search with Policy Priors , 2011, IJCAI.

[29]  U. Rieder,et al.  Markov Decision Processes , 2010 .

[30]  Lancelot F. James,et al.  Gibbs Sampling Methods for Stick-Breaking Priors , 2001 .

[31]  Brahim Chaib-draa,et al.  Predictive representations for policy gradient in POMDPs , 2009, ICML '09.

[32]  Jonathan P. How,et al.  Dynamic Clustering via Asymptotics of the Dependent Dirichlet Process Mixture , 2013, NIPS.

[33]  P. Olver Nonlinear Systems , 2013 .

[34]  T. Ferguson A Bayesian Analysis of Some Nonparametric Problems , 1973 .

[35]  Kee-Eung Kim,et al.  Learning Finite-State Controllers for Partially Observable Environments , 1999, UAI.

[36]  Chong Wang,et al.  Online Variational Inference for the Hierarchical Dirichlet Process , 2011, AISTATS.

[37]  Douglas Aberdeen,et al.  Scalable Internal-State Policy-Gradient Methods for POMDPs , 2002, ICML.

[38]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[39]  Andrew W. Moore,et al.  Generalization in Reinforcement Learning: Safely Approximating the Value Function , 1994, NIPS.

[40]  Marc Toussaint,et al.  Model-free reinforcement learning as mixture learning , 2009, ICML '09.

[41]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[42]  Pierre Geurts,et al.  Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[43]  Neil Immerman,et al.  The Complexity of Decentralized Control of Markov Decision Processes , 2000, UAI.

[44]  Reid G. Simmons,et al.  Heuristic Search Value Iteration for POMDPs , 2004, UAI.

[45]  Nikos A. Vlassis,et al.  The Cross-Entropy Method for Policy Search in Decentralized POMDPs , 2008, Informatica.

[46]  Joelle Pineau,et al.  Towards robotic assistants in nursing homes: Challenges and results , 2003, Robotics Auton. Syst..

[47]  Feng Wu,et al.  Monte-Carlo Expectation Maximization for Decentralized POMDPs , 2013, IJCAI.

[48]  Carl E. Rasmussen,et al.  Sparse Spectrum Gaussian Process Regression , 2010, J. Mach. Learn. Res..

[49]  Ronald E. Parr,et al.  Hierarchical control and learning for markov decision processes , 1998 .

[50]  Chong Wang,et al.  Stochastic variational inference , 2012, J. Mach. Learn. Res..

[51]  Peter Szabó,et al.  Learning to Control an Octopus Arm with Gaussian Process Temporal Difference Methods , 2005, NIPS.

[52]  Peter Stone,et al.  Transfer Learning for Reinforcement Learning Domains: A Survey , 2009, J. Mach. Learn. Res..

[53]  D. Blackwell Discounted Dynamic Programming , 1965 .

[54]  D. J. White,et al.  A Survey of Applications of Markov Decision Processes , 1993 .

[55]  Shalabh Bhatnagar,et al.  Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[56]  Girish Chowdhary,et al.  Off-policy reinforcement learning with Gaussian processes , 2014, IEEE/CAA Journal of Automatica Sinica.

[57]  Marc Toussaint,et al.  Hierarchical POMDP Controller Optimization by Likelihood Maximization , 2008, UAI.

[58]  Andrew Y. Ng,et al.  Regularization and feature selection in least-squares temporal difference learning , 2009, ICML '09.

[59]  Theodore J. Perkins,et al.  Reinforcement learning for POMDPs based on action values and stochastic optimization , 2002, AAAI/IAAI.

[60]  Pascal Poupart,et al.  Model-based Bayesian Reinforcement Learning in Partially Observable Domains , 2008, ISAIM.

[61]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[62]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[63]  Joshua B. Tenenbaum,et al.  Nonparametric Bayesian Policy Priors for Reinforcement Learning , 2010, NIPS.

[64]  Shalabh Bhatnagar,et al.  Toward Off-Policy Learning Control with Function Approximation , 2010, ICML.

[65]  Jesse Hoey,et al.  An analytic solution to discrete Bayesian reinforcement learning , 2006, ICML.

[66]  Guy Shani,et al.  Model-Based Online Learning of POMDPs , 2005, ECML.

[67]  Shie Mannor,et al.  Reinforcement learning with Gaussian processes , 2005, ICML.

[68]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[69]  Shlomo Zilberstein,et al.  Improved Memory-Bounded Dynamic Programming for Decentralized POMDPs , 2007, UAI.

[70]  Joelle Pineau,et al.  Point-based value iteration: An anytime algorithm for POMDPs , 2003, IJCAI.

[71]  Sebastian Thrun,et al.  Efficient Exploration In Reinforcement Learning , 1992 .

[72]  John N. Tsitsiklis,et al.  Analysis of temporal-difference learning with function approximation , 1996, NIPS 1996.

[73]  Ke Jiang,et al.  Small-Variance Asymptotics for Hidden Markov Models , 2013, NIPS.

[74]  Nikos A. Vlassis,et al.  Perseus: Randomized Point-based Value Iteration for POMDPs , 2005, J. Artif. Intell. Res..

[75]  Richard S. Sutton,et al.  Predictive Representations of State , 2001, NIPS.

[76]  Craig Boutilier,et al.  Bounded Finite State Controllers , 2003, NIPS.

[77]  Bikramjit Banerjee,et al.  Sample Bounded Distributed Reinforcement Learning for Decentralized POMDPs , 2012, AAAI.

[78]  Matthew J. Johnson,et al.  Bayesian nonparametric hidden semi-Markov models , 2012, J. Mach. Learn. Res..

[79]  Carl E. Rasmussen,et al.  Gaussian Processes in Reinforcement Learning , 2003, NIPS.

[80]  Bo Liu,et al.  Regularized Off-Policy TD-Learning , 2012, NIPS.

[81]  Shlomo Zilberstein,et al.  Planetary Rover Control as a Markov Decision Process , 2002 .

[82]  Sebastiaan A. Terwijn,et al.  On the Learnability of Hidden Markov Models , 2002, ICGI.

[83]  Lihong Li,et al.  A Bayesian Sampling Approach to Exploration in Reinforcement Learning , 2009, UAI.

[84]  Shlomo Zilberstein,et al.  Policy Iteration for Decentralized Control of Markov Decision Processes , 2009, J. Artif. Intell. Res..

[85]  Dan Lizotte,et al.  Convergent Fitted Value Iteration with Linear Function Approximation , 2011, NIPS.

[86]  Jason Pazis,et al.  PAC Optimal Exploration in Continuous Space Markov Decision Processes , 2013, AAAI.

[87]  Lehel Csató,et al.  Sparse On-Line Gaussian Processes , 2002, Neural Computation.

[88]  Michael L. Littman,et al.  A theoretical analysis of Model-Based Interval Estimation , 2005, ICML.

[89]  Shlomo Zilberstein,et al.  Dynamic Programming for Partially Observable Stochastic Games , 2004, AAAI.

[90]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[91]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[92]  Hui Li,et al.  Region-based value iteration for partially observable Markov decision processes , 2006, ICML.

[93]  Matthew J. Beal Variational algorithms for approximate Bayesian inference , 2003 .

[94]  Padhraic Smyth,et al.  Learning concept graphs from text with stick-breaking priors , 2010, NIPS.

[95]  Bart De Schutter,et al.  Reinforcement Learning and Dynamic Programming Using Function Approximators , 2010 .

[96]  Michael L. Littman,et al.  A unifying framework for computational reinforcement learning theory , 2009 .

[97]  Jaakko Peltonen,et al.  Periodic Finite State Controllers for Efficient POMDP and DEC-POMDP Planning , 2011, NIPS.

[98]  Lydia E. Kavraki,et al.  Automated model approximation for robotic navigation with POMDPs , 2013, 2013 IEEE International Conference on Robotics and Automation.

[99]  G. Roberts,et al.  Retrospective Markov chain Monte Carlo methods for Dirichlet process hierarchical models , 2007, 0710.4228.

[100]  Shlomo Zilberstein,et al.  Anytime Planning for Decentralized POMDPs using Expectation Maximization , 2010, UAI.

[101]  Victor R. Lesser,et al.  Coordinated Multi-Agent Reinforcement Learning in Networked Distributed POMDPs , 2011, AAAI.

[102]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[103]  John N. Tsitsiklis,et al.  The Complexity of Markov Decision Processes , 1987, Math. Oper. Res..

[104]  Siu-Yeung Cho,et al.  A Modified Memory-Based Reinforcement Learning Method for Solving POMDP Problems , 2011, Neural Processing Letters.

[105]  Shie Mannor,et al.  The kernel recursive least-squares algorithm , 2004, IEEE Transactions on Signal Processing.

[106]  É. Moulines,et al.  Convergence of a stochastic approximation version of the EM algorithm , 1999 .

[107]  Yee Whye Teh,et al.  Infinite Hierarchical Hidden Markov Models , 2009, AISTATS.

[108]  Byron Boots,et al.  Spectral Approaches to Learning Predictive Representations , 2011 .

[109]  Nicholas R. Jennings,et al.  Decentralized Bayesian reinforcement learning for online agent collaboration , 2012, AAMAS.

[110]  Jonathan P. How,et al.  Decentralized control of partially observable Markov decision processes , 2015, 52nd IEEE Conference on Decision and Control.

[111]  Frans A. Oliehoek,et al.  Decentralized POMDPs , 2012, Reinforcement Learning.

[112]  Joel Veness,et al.  Monte-Carlo Planning in Large POMDPs , 2010, NIPS.

[113]  Michael I. Jordan,et al.  Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models , 2012, NIPS.

[114]  Michael I. Jordan,et al.  MAD-Bayes: MAP-based Asymptotic Derivations from Bayes , 2012, ICML.

[115]  Sebastian Thrun,et al.  Is Learning The n-th Thing Any Easier Than Learning The First? , 1995, NIPS.

[116]  Marco Wiering,et al.  Utile distinction hidden Markov models , 2004, ICML.

[117]  Pierre Priouret,et al.  Adaptive Algorithms and Stochastic Approximations , 1990, Applications of Mathematics.

[118]  Stuart J. Russell,et al.  Reinforcement Learning with Hierarchies of Machines , 1997, NIPS.

[119]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[120]  Hui Li,et al.  Point-Based Policy Iteration , 2007, AAAI.

[121]  John Langford,et al.  Exploration in Metric State Spaces , 2003, ICML.

[122]  Jonathan P. How,et al.  Planning for decentralized control of multiple robots under uncertainty , 2014, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[123]  Andrew McCallum,et al.  Reinforcement learning with selective perception and hidden state , 1996 .

[124]  Charles L. Isbell,et al.  Point Based Value Iteration with Optimal Belief Compression for Dec-POMDPs , 2013, NIPS.

[125]  Bikramjit Banerjee,et al.  Pruning for Monte Carlo Distributed Reinforcement Learning in Decentralized POMDPs , 2013, AAAI.

[126]  Michael I. Jordan,et al.  Revisiting k-means: New Algorithms via Bayesian Nonparametrics , 2011, ICML.

[127]  Lawrence Carin,et al.  Online Expectation Maximization for Reinforcement Learning in POMDPs , 2013, IJCAI.

[128]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[129]  Thomas J. Walsh,et al.  Exploring compact reinforcement-learning representations with linear regression , 2009, UAI.

[130]  Frans A. Oliehoek,et al.  Value-Based Planning for Teams of Agents in Stochastic Partially Observable Environments , 2010 .

[131]  Lawrence Carin,et al.  Hidden Markov Models With Stick-Breaking Priors , 2009, IEEE Transactions on Signal Processing.

[132]  Tzu-Tsung Wong,et al.  Generalized Dirichlet distribution in Bayesian analysis , 1998, Appl. Math. Comput..

[133]  Peter Stone,et al.  Gaussian Processes for Sample Efficient Reinforcement Learning with RMAX-like Exploration , 2010, ECML/PKDD.

[134]  Sean P. Meyn,et al.  An analysis of reinforcement learning with function approximation , 2008, ICML '08.

[135]  Finale Doshi-Velez,et al.  The Infinite Partially Observable Markov Decision Process , 2009, NIPS.

[136]  Andrew Y. Ng,et al.  Near-Bayesian exploration in polynomial time , 2009, ICML '09.

[137]  François Charpillet,et al.  MAA*: A Heuristic Search Algorithm for Solving Decentralized POMDPs , 2005, UAI.

[138]  Lihong Li,et al.  PAC model-free reinforcement learning , 2006, ICML.

[139]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[140]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[141]  Makoto Yokoo,et al.  Networked Distributed POMDPs: A Synergy of Distributed Constraint Optimization and POMDPs , 2005, IJCAI.

[142]  David B. Dunson,et al.  Approximate Dynamic Programming for Storage Problems , 2011, ICML.

[143]  A. Cassandra A Survey of POMDP Applications , 2003 .

[144]  Lawrence Carin,et al.  The Infinite Regionalized Policy Representation , 2011, ICML.

[145]  H. Kushner,et al.  Stochastic Approximation and Recursive Algorithms and Applications , 2003 .

[146]  David Hsu,et al.  DESPOT: Online POMDP Planning with Regularization , 2013, NIPS.

[147]  Feng Qi (祁锋) Bounds for the Ratio of Two Gamma Functions , 2009 .

[148]  Shlomo Zilberstein,et al.  Optimizing fixed-size stochastic controllers for POMDPs and decentralized POMDPs , 2010, Autonomous Agents and Multi-Agent Systems.

[149]  Eric A. Hansen,et al.  Solving POMDPs by Searching in Policy Space , 1998, UAI.

[150]  Leslie Pack Kaelbling,et al.  Spatial and Temporal Abstractions in POMDPs Applied to Robot Navigation , 2005 .

[151]  Joelle Pineau,et al.  A Bayesian Approach for Learning and Planning in Partially Observable Markov Decision Processes , 2011, J. Mach. Learn. Res..

[152]  T. Başar,et al.  A New Approach to Linear Filtering and Prediction Problems , 2001 .

[153]  Andre Wibisono,et al.  Streaming Variational Bayes , 2013, NIPS.

[154]  Zoubin Ghahramani,et al.  Sparse Gaussian Processes using Pseudo-inputs , 2005, NIPS.

[155]  Leslie Pack Kaelbling,et al.  Planning with macro-actions in decentralized POMDPs , 2014, AAMAS.