Bayesian nonparametric approaches for reinforcement learning in partially observable domains

Making intelligent decisions from incomplete information is critical in many applications: for example, medical decisions must often be made based on a few vital signs, without full knowledge of a patient's condition, and speech-based interfaces must infer a user's needs from noisy microphone inputs. What makes these tasks hard is that we do not even have a natural representation with which to model the task; we must learn about the task's properties while simultaneously performing the task. Learning a representation for a task also involves a trade-off between modeling the data that we have seen previously and being able to make predictions about new data streams. In this thesis, we explore one approach for learning representations of stochastic systems using Bayesian nonparametric statistics. Bayesian nonparametric methods allow the sophistication of a representation to scale gracefully with the complexity in the data. We show how the representations learned using Bayesian nonparametric methods result in better performance and interesting learned structure in three contexts related to reinforcement learning in partially-observable domains: learning partially observable Markov Decision processes, taking advantage of expert demonstrations, and learning complex hidden structures such as dynamic Bayesian networks. In each of these contexts, Bayesian nonparametric approach provide advantages in prediction quality and often computation time. (Copies available exclusively from MIT Libraries, Rm. 14-0551, Cambridge, MA 02139-4307. Ph. 617-253-5668; Fax 617-253-1690.)

[1]  M. A. Girshick,et al.  Theory of games and statistical decisions , 1955 .

[2]  R. Bellman Dynamic programming. , 1957, Science.

[3]  Edward J. Sondik,et al.  The optimal control of par-tially observable Markov processes , 1971 .

[4]  Richard Fikes,et al.  STRIPS: A New Approach to the Application of Theorem Proving to Problem Solving , 1971, IJCAI.

[5]  T. Ferguson A Bayesian Analysis of Some Nonparametric Problems , 1973 .

[6]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[7]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[8]  Gary L. Drescher,et al.  Made-up minds - a constructivist approach to artificial intelligence , 1991 .

[9]  Andreas Stolcke,et al.  Hidden Markov Model} Induction by Bayesian Model Merging , 1992, NIPS.

[10]  Lonnie Chrisman,et al.  Reinforcement Learning with Perceptual Aliasing: The Perceptual Distinctions Approach , 1992, AAAI.

[11]  Andrew McCallum,et al.  Overcoming Incomplete Perception with Utile Distinction Memory , 1993, ICML.

[12]  R. Kohn,et al.  On Gibbs sampling for state space models , 1994 .

[13]  Leslie Pack Kaelbling,et al.  Learning Policies for Partially Observable Environments: Scaling Up , 1997, ICML.

[14]  Len Breslow Greedy Utile Suffix Memory for Reinforcement Learning with Perceptually-Aliased States , 1996 .

[15]  Zoubin Ghahramani,et al.  Learning Dynamic Bayesian Networks , 1997, Summer School on Neural Networks.

[16]  J. Pitman,et al.  The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator , 1997 .

[17]  Eric A. Hansen,et al.  An Improved Policy Iteration Algorithm for Partially Observable MDPs , 1997, NIPS.

[18]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[19]  Stephen P. Brooks Quantitative convergence assessment for Markov chain Monte Carlo via cusums , 1998, Stat. Comput..

[20]  David Andre,et al.  Model based Bayesian Exploration , 1999, UAI.

[21]  David A. McAllester,et al.  Approximate Planning for Factored POMDPs using Belief State Simplification , 1999, UAI.

[22]  Jesse Hoey,et al.  SPUDD: Stochastic Planning using Decision Diagrams , 1999, UAI.

[23]  Matthew Brand,et al.  Structure Learning in Conditional Probability Models via an Entropic Prior and Parameter Extinction , 1999, Neural Computation.

[24]  Joelle Pineau,et al.  Spoken Dialogue Management Using Probabilistic Reasoning , 2000, ACL.

[25]  Malcolm J. A. Strens,et al.  A Bayesian Framework for Reinforcement Learning , 2000, ICML.

[26]  Colin de la Higuera,et al.  Probabilistic DFA Inference using Kullback-Leibler Divergence and Minimality , 2000, ICML.

[27]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[28]  Kee-Eung Kim,et al.  Approximate Solutions to Factored Markov Decision Processes via Greedy Search in the Space of Finite State Controllers , 2000, AIPS.

[29]  Ronald E. Parr,et al.  Solving Factored POMDPs with Linear Value Functions , 2001 .

[30]  Carl E. Rasmussen,et al.  Factorial Hidden Markov Models , 1997 .

[31]  Kevin P. Murphy,et al.  The Factored Frontier Algorithm for Approximate Inference in DBNs , 2001, UAI.

[32]  Douglas Aberdeen,et al.  Scalable Internal-State Policy-Gradient Methods for POMDPs , 2002, ICML.

[33]  Stuart J. Russell,et al.  Dynamic bayesian networks: representation, inference and learning , 2002 .

[34]  Andrew G. Barto,et al.  Optimal learning: computational procedures for bayes-adaptive markov decision processes , 2002 .

[35]  Craig Boutilier,et al.  Value-Directed Compression of POMDPs , 2002, NIPS.

[36]  Joelle Pineau,et al.  Point-based value iteration: An anytime algorithm for POMDPs , 2003, IJCAI.

[37]  Radford M. Neal Slice Sampling , 2003, The Annals of Statistics.

[38]  Azaria Paz,et al.  Probabilistic automata , 2003 .

[39]  Craig Boutilier,et al.  Bounded Finite State Controllers , 2003, NIPS.

[40]  Arthur Pchelkin,et al.  Efficient Exploration in Reinforcement Learning Based on Utile Suffix Memory , 2003, Informatica.

[41]  Reid G. Simmons,et al.  Heuristic Search Value Iteration for POMDPs , 2004, UAI.

[42]  Michael R. James,et al.  Predictive State Representations: A New Theory for Modeling Dynamical Systems , 2004, UAI.

[43]  Michael I. Jordan,et al.  Dimensionality Reduction for Supervised Learning with Reproducing Kernel Hilbert Spaces , 2004, J. Mach. Learn. Res..

[44]  Guy Shani,et al.  Resolving Perceptual Aliasing In The Presence Of Noisy Sensors , 2004, NIPS.

[45]  Michael L. Littman,et al.  Planning with predictive state representations , 2004, 2004 International Conference on Machine Learning and Applications, 2004. Proceedings..

[46]  Joelle Pineau,et al.  A Hierarchical Approach to POMDP Planning and Execution , 2004 .

[47]  Sebastian Thrun,et al.  Learning low dimensional predictive representations , 2004, ICML.

[48]  Cosma Rohilla Shalizi,et al.  Blind Construction of Optimal Nonlinear Recursive Predictors for Discrete Sequences , 2004, UAI.

[49]  Prashant Doshi,et al.  Interactive POMDPs: properties and preliminary results , 2004, Proceedings of the Third International Joint Conference on Autonomous Agents and Multiagent Systems, 2004. AAMAS 2004..

[50]  Nikos A. Vlassis,et al.  Perseus: Randomized Point-based Value Iteration for POMDPs , 2005, J. Artif. Intell. Res..

[51]  Joelle Pineau,et al.  Active Learning in Partially Observable Markov Decision Processes , 2005, ECML.

[52]  Pierre Dupont,et al.  Links between probabilistic automata and hidden Markov models: probability distributions, learning models and induction algorithms , 2005, Pattern Recognit..

[53]  Dr. Marcus Hutter,et al.  Universal artificial intelligence , 2004 .

[54]  J. S. Rao,et al.  Spike and slab variable selection: Frequentist and Bayesian strategies , 2005, math/0505633.

[55]  Yishay Mansour,et al.  Reinforcement Learning in POMDPs Without Resets , 2005, IJCAI.

[56]  J.D. Williams,et al.  Scaling up POMDPs for Dialog Management: The ``Summary POMDP'' Method , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..

[57]  Tao Wang,et al.  Bayesian sparse sampling for on-line reward optimization , 2005, ICML.

[58]  Doina Precup,et al.  Learning in non-stationary Partially Observable Markov Decision Processes , 2005 .

[59]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 2005, IEEE Transactions on Neural Networks.

[60]  Shie Mannor,et al.  Reinforcement learning with Gaussian processes , 2005, ICML.

[61]  Jesper Tegnér,et al.  Learning dynamic Bayesian network models via cross-validation , 2005, Pattern Recognit. Lett..

[62]  Joelle Pineau,et al.  Representing Systems with Hidden State , 2006, AAAI.

[63]  Charles Lee Isbell,et al.  Looping suffix tree-based inference of partially observable hidden state , 2006, ICML.

[64]  Pieter Abbeel,et al.  Using inaccurate models in reinforcement learning , 2006, ICML.

[65]  Yee Whye Teh,et al.  A Hierarchical Bayesian Language Model Based On Pitman-Yor Processes , 2006, ACL.

[66]  Pascal Poupart,et al.  Automated Hierarchy Discovery for Planning in Partially Observable Environments , 2006, NIPS.

[67]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[68]  Stefan Schaal,et al.  Policy Gradient Methods for Robotics , 2006, 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[69]  Jérôme Lang,et al.  Purely Epistemic Markov Decision Processes , 2007, AAAI.

[70]  B M Ng Adaptive Dynamic Bayesian Networks , 2007 .

[71]  Marcus Hutter,et al.  Universal Algorithmic Intelligence: A Mathematical Top→Down Approach , 2007, Artificial General Intelligence.

[72]  Max Welling,et al.  Infinite State Bayes-Nets for Structured Domains , 2007, NIPS.

[73]  Satinder P. Singh,et al.  Exponential Family Predictive Representations of State , 2007, NIPS.

[74]  Joelle Pineau,et al.  Bayes-Adaptive POMDPs , 2007, NIPS.

[75]  Jürgen Schmidhuber,et al.  Solving Deep Memory POMDPs with Recurrent Policy Gradients , 2007, ICANN.

[76]  Guy Shani,et al.  Forward Search Value Iteration for POMDPs , 2007, IJCAI.

[77]  T. Griffiths,et al.  Bayesian nonparametric latent feature models , 2007 .

[78]  Olivier Buffet,et al.  Policy-Gradients for PSRs and POMDPs , 2007, AISTATS.

[79]  Alicia Peregrin POMDP Homomorphisms , 2007 .

[80]  Zheng Qin,et al.  Research on Structure Learning of Dynamic Bayesian Networks by Particle Swarm Optimization , 2007, 2007 IEEE Symposium on Artificial Life.

[81]  Kee-Eung Kim,et al.  Symbolic Heuristic Search Value Iteration for Factored POMDPs , 2008, AAAI.

[82]  Joelle Pineau,et al.  Online Planning Algorithms for POMDPs , 2008, J. Artif. Intell. Res..

[83]  David Heckerman,et al.  A Tutorial on Learning with Bayesian Networks , 1999, Innovations in Bayesian Networks.

[84]  A. Doucet,et al.  A Tutorial on Particle Filtering and Smoothing: Fifteen years later , 2008 .

[85]  Michael I. Jordan,et al.  An HDP-HMM for systems with state persistence , 2008, ICML '08.

[86]  Pascal Poupart,et al.  Model-based Bayesian Reinforcement Learning in Partially Observable Domains , 2008, ISAIM.

[87]  Yee Whye Teh,et al.  Beam sampling for the infinite hidden Markov model , 2008, ICML '08.

[88]  Joelle Pineau,et al.  Model-Based Bayesian Reinforcement Learning in Large Structured Domains , 2008, UAI.

[89]  David Hsu,et al.  SARSOP: Efficient Point-Based POMDP Planning by Approximating Optimally Reachable Belief Spaces , 2008, Robotics: Science and Systems.

[90]  Joelle Pineau,et al.  Bayesian reinforcement learning in continuous POMDPs with application to robot navigation , 2008, 2008 IEEE International Conference on Robotics and Automation.

[91]  Yee Whye Teh,et al.  The Infinite Factorial Hidden Markov Model , 2008, NIPS.

[92]  Marc Toussaint,et al.  Hierarchical POMDP Controller Optimization by Likelihood Maximization , 2008, UAI.

[93]  Joelle Pineau,et al.  Reinforcement learning with limited reinforcement: using Bayes risk for active learning in POMDPs , 2008, ICML '08.

[94]  Ricard Gavaldà,et al.  Towards Feasible PAC-Learning of Probabilistic Deterministic Finite Automata , 2008, ICGI.

[95]  Tai Sing Lee,et al.  The Block Diagonal Infinite Hidden Markov Model , 2009, AISTATS.

[96]  Andrew Y. Ng,et al.  Near-Bayesian exploration in polynomial time , 2009, ICML '09.

[97]  Uwe D. Hanebeck,et al.  Analytic moment-based Gaussian process filtering , 2009, ICML '09.

[98]  Finale Doshi-Velez,et al.  The Infinite Partially Observable Markov Decision Process , 2009, NIPS.

[99]  Thomas L. Griffiths,et al.  Online Inference of Topics with Latent Dirichlet Allocation , 2009, AISTATS.

[100]  Lihong Li,et al.  A Bayesian Sampling Approach to Exploration in Reinforcement Learning , 2009, UAI.

[101]  Siddhartha S. Srinivasa,et al.  Inverse Optimal Heuristic Control for Imitation Learning , 2009, AISTATS.

[102]  Michael I. Jordan,et al.  Sharing Features among Dynamical Systems with Beta Processes , 2009, NIPS.

[103]  Joshua B. Tenenbaum,et al.  The Infinite Latent Events Model , 2009, UAI.

[104]  Yee Whye Teh,et al.  Infinite Hierarchical Hidden Markov Models , 2009, AISTATS.

[105]  James G. Scott,et al.  Handling Sparsity via the Horseshoe , 2009, AISTATS.

[106]  Marcus Hutter,et al.  Consistency of Feature Markov Processes , 2010, ALT.

[107]  M. M. Hassan Mahmud,et al.  Constructing States for Reinforcement Learning , 2010, ICML.

[108]  Scott Sanner,et al.  Symbolic Dynamic Programming for First-order POMDPs , 2010, AAAI.

[109]  Byron Boots,et al.  Closing the learning-planning loop with predictive state representations , 2009, Int. J. Robotics Res..

[110]  Matthew J. Johnson,et al.  The Hierarchical Dirichlet Process Hidden Semi-Markov Model , 2010, UAI.

[111]  Alan Fern,et al.  Bayesian Policy Search for Multi-Agent Role Discovery , 2010, AAAI.

[112]  Joshua B. Tenenbaum,et al.  Nonparametric Bayesian Policy Priors for Reinforcement Learning , 2010, NIPS.

[113]  Le Song,et al.  Hilbert Space Embeddings of Hidden Markov Models , 2010, ICML.

[114]  Christos Dimitrakakis,et al.  Bayesian variable order Markov models , 2010, AISTATS.

[115]  Ryan P. Adams,et al.  Learning the Structure of Deep Sparse Graphical Models , 2009, AISTATS.

[116]  Roni Khardon,et al.  Relational Partially Observable MDPs , 2010, AAAI.

[117]  Tommi S. Jaakkola,et al.  Learning Bayesian Network Structure using LP Relaxations , 2010, AISTATS.

[118]  David Pfau,et al.  Probabilistic Deterministic Infinite Automata , 2010, NIPS.

[119]  Joel Veness,et al.  Monte-Carlo Planning in Large POMDPs , 2010, NIPS.

[120]  Joel Veness,et al.  A Monte-Carlo AIXI Approximation , 2009, J. Artif. Intell. Res..

[121]  Siu-Yeung Cho,et al.  A Modified Memory-Based Reinforcement Learning Method for Solving POMDP Problems , 2011, Neural Processing Letters.

[122]  Michael I. Jordan,et al.  Bayesian Nonparametric Latent Feature Models , 2011 .

[123]  Thomas L. Griffiths,et al.  The Indian Buffet Process: An Introduction and Review , 2011, J. Mach. Learn. Res..

[124]  Chong Wang,et al.  Online Variational Inference for the Hierarchical Dirichlet Process , 2011, AISTATS.

[125]  Joshua B. Tenenbaum,et al.  Infinite Dynamic Bayesian Networks , 2011, ICML.

[126]  Michael I. Jordan,et al.  Bayesian Nonparametric Inference of Switching Dynamic Linear Models , 2010, IEEE Transactions on Signal Processing.

[127]  Byron Boots,et al.  Closing the learning-planning loop with predictive state representations , 2011, Int. J. Robotics Res..

[128]  Frank D. Wood,et al.  The sequence memoizer , 2011, Commun. ACM.

[129]  Abel Rodriguez,et al.  On-Line Learning for the Infinite Hidden Markov Model , 2011, Commun. Stat. Simul. Comput..

[130]  Alborz Geramifard,et al.  Online Discovery of Feature Dependencies , 2011, ICML.

[131]  Byron Boots,et al.  An Online Spectral Learning Algorithm for Partially Observable Nonlinear Dynamical Systems , 2011, AAAI.

[132]  Emin Orhan Dirichlet Processes , 2012 .

[133]  Christel Baier,et al.  Probabilistic ω-automata , 2012, JACM.

[134]  Peter S. Maybeck,et al.  Stochastic Models, Estimation And Control , 2012 .