Prediction-Constrained POMDPs

We propose prediction-constrained (PC) training for POMDPs, simultaneously yielding high-reward policies while explaining observed histories well. PC training allows effective model learning even in settings with misspecified models, as it can ignore observation patterns that would distract classic two-stage training. 1 Motivation and Background The partially observed Markov decision process (POMDP) [Monahan, 1982, Kaelbling et al., 1998] is a popular approach for learning to act in partially-observable domains. When the parameters of the POMDP are unknown (as is typical in reinforcement learning settings), a standard approach to identifying the optimal policy involves two stages: first, we fit transition and observation models given the data, and then we solve the learnt POMDP to obtain a policy [Chrisman, 1992]. However, if not all of the signal in the observations is relevant for decision-making, this two-stage process can result in the first stage wasting modeling effort and the second stage learning inferior policies. We propose a novel POMDP training objective that balances two goals: providing accurate explanations of the data through a generative model (the POMDP), and learning a high-value policy. This two-term objective ensures that we do not waste computation developing an accurate model for parts of the problem that are irrelevant to decision making. As our method is model-based it will tend to be more sample efficient than alternative model-free deep learning methods, e.g. Hausknecht and Stone [2015]. This is a particular advantage in domains of limited data availability, such as healthcare. POMDP Background and Notation. We consider POMDPs with K discrete states, A discrete actions, and continuous D-dimensional observations. Let τajk ≡ p(s′ = k|s = j, a = a) denote the probability of transitioning from state j to k after taking action a, with ∑ k τajk = 1. For each dimension d ∈ {1, 2, . . . D}, we independently sample observation od ∼ N (μkad, σ kad), where k identifies the state just entered (s′) and a is the action just taken. Let θ = {τ, μ, σ} denote the collection of model parameters. These parameters define an input-output hidden Markov model (IO-HMM) [Bengio and Frasconi, 1995], where the likelihood p(o|a, θ) of observations given actions can be evaluated via dynamic programming. Given θ and a learned reward function r(s, a) (we do not include r in the likelihood), we can solve the POMDP for the optimal policy. We use point based value iteration (PBVI) [Pineau et al., 2003, Shani et al., 2013], an efficient solver for the small and medium-sized POMDPs we interested in. We adapt ideas from Hoey and Poupart [2005] to handle continuous observations. 2 Proposed Method: Prediction-Constrained POMDP Unlike existing two-stage methods [Chrisman, 1992, Koenig and Simmons, 1998], which learn θ by maximizing an IO-HMM likelihood alone, our new training objective learns θ by maximizing both the likelihood and an estimated value of the policy π(θ) given by PBVI: max θ 1 D( ∑ n Tn) ∑ n∈Dexpl log p(on,1:Tn |an,1:Tn−1, θ) + λ · value(π(θ), πbeh,Dbeh, r, γ). (1) The first term is the IO-HMM data likelihood, while the second term is an off-policy estimate of the value of the optimal policy under the model parameters θ (see details below). Conceptually, Workshop on Reinforcement Learning under Partial Observability, NeurIPS 2018 1 2 4 6 8 Number of dims 2.0 1.5 1.0 0.5 0.0 A vg . m ar gi na l l ik el ih oo d pe r di m en si on Average HMM marginal likelihood on test set 1 2 4 6 8 Number of dims 1.5 1.0 0.5 0.0 0.5 R et ur n Average return of PBVI policy on test set oracle 2 stage (EM) 2 stage (EM+) lambda: 0.01 lambda: 0.10 lambda: 1.00 lambda: 10.00 lambda: 100.00 Figure 1: Tiger results as distraction dimensions grow, with K = 2, σ = 0.2. PC training with λ > 0 outperforms two-stage EM+PBVI in policy value, while still having reasonable likelihoods. “Oracle” is an ideal θ that models dimension 1 well enough such that π(θ) finds the safe door. EM+ is a heuristic improvement to two-stage training so dimensions correlated with rewards are constrained to have lower σ than other dimensions. Eq. (1) trades off generative and reward-seeking properties of the model parameters. The tradeoff scalar λ > 0 controls how important the reward-seeking term is. We term our approach predictionconstrained (PC) training for POMDPs. Our PC-POMDP method extends recent PC objectives for supervising topic models and mixture models [Hughes et al., 2017, 2018] to reinforcement learning. It remains to determine how to quantify the quality of the generative model and the quality of the policy. We optimize the likelihood model on sequences Dexpl collected under an exploration policy. This set covers many possible state-action histories and thus allows better estimation of all entries in the transition and emission parameters τ, μ, σ ∈ θ. For the value of the policy, we cannot simply use the estimated value from the POMDP solver, as a misspecified set of parameters θ could choose to hallucinate an arbitrarily high reward. One choice would be via Monte Carlo roll-outs of the policy. To reuse rollouts, we turn to off-policy estimation. Specifically, we collect rollouts under a reference behavior policy πbeh (known in advance). We then use consistent weighted per-decision importance sampling (CWPDIS) [Thomas, 2015] to reweight observed rewards from data collected under πbeh to yield a consistent estimate of the long-term value of our model policy π(θ) under discount factor γ ∈ (0, 1). Crucially, this estimator is a differentiable function of the model parameters θ, and thus Eq. (1) can be optimized via first-order gradient ascent. 3 Synthetic Experiment: Noisy Tiger Problem We evaluate our PC approach on a challenging extension of the classic POMDP tiger problem Kaelbling et al. [1998]. A room has K doors; only one door is safe while the remaining K − 1 have tigers behind them. The agent has A = K +1 actions: either open one of the doors or listen for noisy evidence of which door is safe to open. Revealing a tiger gives −5 reward, while the safe door yields +1 reward, and listening incurs−0.1 reward. Observations ont haveD ≥ 1 dimensions. Only the first dimension signals the safe door via its mean isafe ∈ {1, . . . ,K}: ont1 ∼ N (isafe, σ) where σ = 0.2. The remaining dimensions are irrelevant, each with random mean i ∼ Unif({1, . . . ,K}) and narrow Gaussian standard deviation of 0.1 (less than σ = 0.2). This environment is designed to confuse the two-stage method that fits θ via an expectation-maximization (EM) procedure that maximizes likelihood only, as this first-stage will prefer explaining the irrelevant but low-noise dimensions, rather than the relevant but higher-noise first dimension. Note K states are needed to perfectly model the data but only K states are needed to learn an optimal PBVI policy π(θ). Our proposed PC-POMDP approach with only K states will, given a large enough emphasis on the reward term, favor parameters that focus on the signal dimension and reap better rewards. Outlook. We anticipate PC training for POMDPs will have advantages when models are misspecified, so λ 0 encourages rewards to guide parameters while still learning a good HMM. We plan future applications in the medical domain where our approach is uniquely suited to the combination of noisy, possibly irrelevant observations in a batch setting. Our joint-training paradigm also allows us to learn from semi-supervised data, where some sequences are missing rewards.

[1]  Edward J. Sondik,et al.  The Optimal Control of Partially Observable Markov Processes over the Infinite Horizon: Discounted Costs , 1978, Oper. Res..

[2]  G. Monahan State of the Art—A Survey of Partially Observable Markov Decision Processes: Theory, Models, and Algorithms , 1982 .

[3]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[4]  Lonnie Chrisman,et al.  Reinforcement Learning with Perceptual Aliasing: The Perceptual Distinctions Approach , 1992, AAAI.

[5]  Yoshua Bengio,et al.  An Input Output HMM Architecture , 1994, NIPS.

[6]  R. Simmons,et al.  Xavier: A Robot Navigation Architecture Based on Partially Observable Markov Decision Process Models , 1998 .

[7]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[8]  Joelle Pineau,et al.  Point-based value iteration: An anytime algorithm for POMDPs , 2003, IJCAI.

[9]  Jesse Hoey,et al.  Solving POMDPs with Continuous or Large Discrete Observation Spaces , 2005, IJCAI.

[10]  Guy Shani,et al.  Noname manuscript No. (will be inserted by the editor) A Survey of Point-Based POMDP Solvers , 2022 .

[11]  Philip S. Thomas,et al.  Safe Reinforcement Learning , 2015 .

[12]  Peter Stone,et al.  Deep Recurrent Q-Learning for Partially Observable MDPs , 2015, AAAI Fall Symposia.

[13]  Finale Doshi-Velez,et al.  Prediction-Constrained Training for Semi-Supervised Mixture and Topic Models , 2017, ArXiv.

[14]  Finale Doshi-Velez,et al.  Semi-Supervised Prediction-Constrained Topic Models , 2018, AISTATS.