论文信息 - Inverse Rational Control with Partially Observable Continuous Nonlinear Dynamics - 字舞流文

Inverse Rational Control with Partially Observable Continuous Nonlinear Dynamics

A fundamental question in neuroscience is how the brain creates an internal model of the world to guide actions using sequences of ambiguous sensory information. This is naturally formulated as a reinforcement learning problem under partial observations, where an agent must estimate relevant latent variables in the world from its evidence, anticipate possible future states, and choose actions that optimize total expected reward. This problem can be solved by control theory, which allows us to find the optimal actions for a given system dynamics and objective function. However, animals often appear to behave suboptimally. Why? We hypothesize that animals have their own flawed internal model of the world, and choose actions with the highest expected subjective reward according to that flawed model. We describe this behavior as rational but not optimal. The problem of Inverse Rational Control (IRC) aims to identify which internal model would best explain an agent's actions. Our contribution here generalizes past work on Inverse Rational Control which solved this problem for discrete control in partially observable Markov decision processes. Here we accommodate continuous nonlinear dynamics and continuous actions, and impute sensory observations corrupted by unknown noise that is private to the animal. We first build an optimal Bayesian agent that learns an optimal policy generalized over the entire model space of dynamics and subjective rewards using deep reinforcement learning. Crucially, this allows us to compute a likelihood over models for experimentally observable action trajectories acquired from a suboptimal agent. We then find the model parameters that maximize the likelihood using gradient ascent.

Saurabh Daptardar | Paul Schrater | Xaq Pitkow | P. Schrater | Xaq Pitkow | Saurabh Daptardar

[1] Andrea Bonarini,et al. Reinforcement Learning in Continuous Action Spaces through Sequential Monte Carlo Methods , 2007, NIPS.

[2] Rajesh P. N. Rao,et al. Reward Optimization in the Primate Brain: A Probabilistic Model of Decision Making under Uncertainty , 2013, PloS one.

[3] Jeffrey K. Uhlmann,et al. Unscented filtering and nonlinear estimation , 2004, Proceedings of the IEEE.

[4] Sergey Levine,et al. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[5] Satinder Singh,et al. Computational Rationality: Linking Mechanism and Behavior Through Bounded Utility Maximization , 2014, Top. Cogn. Sci..

[6] D. Wolpert,et al. Cognitive Tomography Reveals Complex, Task-Independent Mental Representations , 2013, Current Biology.

[7] Alexander J. Smola,et al. Meta-Q-Learning , 2020, ICLR.

[8] Paul Schrater,et al. Inverse POMDP: Inferring What You Think from What You Do , 2018, ArXiv.

[9] Eyal Amir,et al. Bayesian Inverse Reinforcement Learning , 2007, IJCAI.

[10] Karl J. Friston,et al. Observing the Observer (II): Deciding When to Decide , 2010, PloS one.

[11] Wolfram Burgard,et al. Inverse Reinforcement Learning with Simultaneous Estimation of Rewards and Dynamics , 2016, AISTATS.

[12] S. Levine,et al. ADAIL: Adaptive Adversarial Imitation Learning , 2020 .

[13] S. Shankar Sastry,et al. Maximum Likelihood Constraint Inference for Inverse Reinforcement Learning , 2020, ICLR.

[14] Karl Johan Åström,et al. Optimal control of Markov processes with incomplete state information , 1965 .

[15] Xaq Pitkow,et al. Tracking the Mind’s Eye: Primate Gaze Behavior during Virtual Visuomotor Navigation Reflects Belief Dynamics , 2020, Neuron.

[16] Zhengwei Wu,et al. Rational thoughts in neural codes , 2020, Proceedings of the National Academy of Sciences.

[17] Ronald A. Howard,et al. Dynamic Programming and Markov Processes , 1960 .

[18] Yuval Tassa,et al. Continuous control with deep reinforcement learning , 2015, ICLR.

[19] Katja Hofmann,et al. Fast Context Adaptation via Meta-Learning , 2018, ICML.

[20] Stefano Ermon,et al. Generative Adversarial Imitation Learning , 2016, NIPS.

[21] Kee-Eung Kim,et al. Inverse Reinforcement Learning in Partially Observable Environments , 2009, IJCAI.

[22] Thomas L. Griffiths,et al. Cognitive Model Priors for Predicting Human Decisions , 2019, ICML.

[23] Joshua B. Tenenbaum,et al. Bayesian Theory of Mind: Modeling Joint Belief-Desire Attribution , 2011, CogSci.

[24] Michael L. Littman,et al. Apprenticeship Learning About Multiple Intentions , 2011, ICML.

[25] Constantin A. Rothkopf,et al. I See What You See: Inferring Sensor and Policy Models of Human Real-World Motor Behavior , 2017, AAAI.

[26] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[27] Stefan Schaal,et al. Learning objective functions for manipulation , 2013, 2013 IEEE International Conference on Robotics and Automation.

[28] Rajesh P. N. Rao,et al. Decision Making Under Uncertainty: A Neural Model Based on Partially Observable Markov Decision Processes , 2010, Front. Comput. Neurosci..

[29] Jonathan Tompson,et al. ADAIL: Adaptive Adversarial Imitation Learning , 2020, ArXiv.

[30] Stuart J. Russell. Learning agents for uncertain environments (extended abstract) , 1998, COLT' 98.

[31] Sergey Levine,et al. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[32] D. Rubin,et al. Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[33] Gašper Tkačik,et al. Inferring the function performed by a recurrent neural network , 2019, PloS one.

[34] Yoshua Bengio,et al. Bayesian Model-Agnostic Meta-Learning , 2018, NeurIPS.

[35] Benjamin Beyret,et al. The Animal-AI Olympics , 2019, Nature Machine Intelligence.

[36] P. Dayan,et al. Decision theory, reinforcement learning, and the brain , 2008, Cognitive, affective & behavioral neuroscience.

[37] Karl J. Friston,et al. Observing the Observer (I): Meta-Bayesian Models of Learning and Decision-Making , 2010, PloS one.

[38] Kee-Eung Kim,et al. A Bayesian Approach to Generative Adversarial Imitation Learning , 2018, NeurIPS.

[39] Rajesh P. N. Rao,et al. Bayesian brain : probabilistic approaches to neural coding , 2006 .

[40] Chris L. Baker,et al. Rational quantitative attribution of beliefs, desires and percepts in human mentalizing , 2017, Nature Human Behaviour.

[41] Leslie Pack Kaelbling,et al. Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[42] Jan Peters,et al. Relative Entropy Inverse Reinforcement Learning , 2011, AISTATS.

[43] Emanuel Todorov,et al. Inverse Optimal Control with Linearly-Solvable MDPs , 2010, ICML.

[44] Geoffrey E. Hinton,et al. Massively Parallel Architectures for AI: NETL, Thistle, and Boltzmann Machines , 1983, AAAI.

[45] Jo van Nunen,et al. A set of successive approximation methods for discounted Markovian decision problems , 1976, Math. Methods Oper. Res..

[46] M. Puterman,et al. Modified Policy Iteration Algorithms for Discounted Markov Decision Problems , 1978 .

[47] Pascal Vincent,et al. Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[48] Maneesh Sahani,et al. Flexible and accurate inference and learning for deep generative models , 2018, NeurIPS.

[49] Sean R Eddy,et al. What is dynamic programming? , 2004, Nature Biotechnology.

[50] Falk Lieder,et al. Resource-rational analysis: Understanding human cognition as the optimal use of limited computational resources , 2019, Behavioral and Brain Sciences.

[51] E. Jaynes. Information Theory and Statistical Mechanics , 1957 .

[52] Gregory C. DeAngelis,et al. A Dynamic Bayesian Observer Model Reveals Origins of Bias in Visual Path Integration , 2017, Neuron.

[53] Xaq Pitkow,et al. A Dynamic Bayesian Observer Model Reveals Origins of Bias in Visual Path Integration , 2017 .

[54] Monica C. Vroman. MAXIMUM LIKELIHOOD INVERSE REINFORCEMENT LEARNING , 2014 .

[55] Nasser M. Nasrabadi,et al. Pattern Recognition and Machine Learning , 2006, Technometrics.

[56] J. Andrew Bagnell,et al. Maximum margin planning , 2006, ICML.

[57] Sanjit A. Seshia,et al. Learning Task Specifications from Demonstrations , 2017, NeurIPS.

[58] John N. Tsitsiklis,et al. Actor-Critic Algorithms , 1999, NIPS.

[59] Anind K. Dey,et al. Maximum Entropy Inverse Reinforcement Learning , 2008, AAAI.

[60] Anca D. Dragan,et al. Where Do You Think You're Going?: Inferring Beliefs about Dynamics from Behavior , 2018, NeurIPS.

[61] A. Pouget,et al. Not Noisy, Just Wrong: The Role of Suboptimal Inference in Behavioral Variability , 2012, Neuron.

[62] Richard L. Lewis,et al. Rational adaptation under task and processing constraints: implications for testing theories of cognition and action. , 2009, Psychological review.

[63] R. Bellman. A Markovian Decision Process , 1957 .

[64] H. Sebastian Seung,et al. Q-Learning for Continuous Actions with Cross-Entropy Guided Policies , 2019, ArXiv.

[65] Andrew Y. Ng,et al. Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[66] Tanmay Gangwani,et al. State-only Imitation with Transition Dynamics Mismatch , 2020, ICLR.

[67] Thomas L. Griffiths,et al. Inferring Learners' Knowledge From Their Actions , 2015, Cogn. Sci..

[68] Sergey Levine,et al. Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization , 2016, ICML.

[69] Pieter Abbeel,et al. Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[70] Yee Whye Teh,et al. Meta reinforcement learning as task inference , 2019, ArXiv.