论文信息 - A Prioritized objective actor-critic method for deep reinforcement learning

A Prioritized objective actor-critic method for deep reinforcement learning

An increasing number of complex problems have naturally posed significant challenges in decision-making theory and reinforcement learning practices. These problems often involve multiple conflicting reward signals that inherently cause agents’ poor exploration in seeking a specific goal. In extreme cases, the agent gets stuck in a sub-optimal solution and starts behaving harmfully. To overcome such obstacles, we introduce two actor-critic deep reinforcement learning methods, namely Multi-Critic Single Policy (MCSP) and Single Critic Multi-Policy (SCMP), which can adjust agent behaviors to efficiently achieve a designated goal by adopting a weighted-sum scalarization of different objective functions. In particular, MCSP creates a human-centric policy that corresponds to a predefined priority weight of different objectives. Whereas, SCMP is capable of generating a mixed policy based on a set of priority weights, i.e., the generated policy uses the knowledge of different policies (each policy corresponds to a priority weight) to dynamically prioritize objectives in real time. We examine our methods by using the Asynchronous Advantage Actor-Critic (A3C) algorithm to utilize the multithreading mechanism for dynamically balancing training intensity of different policies into a single network. Finally, simulation results show that MCSP and SCMP significantly outperform A3C with respect to the mean of total rewards in two complex problems: Food Collector and Seaquest.

[1] Tom Schaul,et al. Prioritized Experience Replay , 2015, ICLR.

[2] Runzhe Yang,et al. A Generalized Algorithm for Multi-Objective RL and Policy Adaptation , 2019 .

[3] Csaba Szepesvári,et al. Multi-criteria Reinforcement Learning , 1998, ICML.

[4] Tom Schaul,et al. Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[5] John A. Bullinaria. Evolved age dependent plasticity improves neural network performance , 2005, Fifth International Conference on Hybrid Intelligent Systems (HIS'05).

[6] David Silver,et al. Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[7] Dimitri P. Bertsekas,et al. Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[8] Sridhar Mahadevan,et al. Automatic Programming of Behavior-Based Robots Using Reinforcement Learning , 1991, Artif. Intell..

[9] John Schulman,et al. Concrete Problems in AI Safety , 2016, ArXiv.

[10] Mikhail Pavlov,et al. Deep Attention Recurrent Q-Network , 2015, ArXiv.

[11] Hado van Hasselt,et al. Double Q-learning , 2010, NIPS.

[12] Alex Graves,et al. Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[13] Marc G. Bellemare,et al. The Arcade Learning Environment: An Evaluation Platform for General Agents , 2012, J. Artif. Intell. Res..

[14] Thomas G. Dietterich. Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition , 1999, J. Artif. Intell. Res..

[15] Alex Graves,et al. Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[16] Carme Torras,et al. Safe robot execution in model-based reinforcement learning , 2015, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[17] Shimon Whiteson,et al. A Survey of Multi-Objective Sequential Decision-Making , 2013, J. Artif. Intell. Res..

[18] John Yearwood,et al. On the Limitations of Scalarisation for Multi-objective Reinforcement Learning of Pareto Fronts , 2008, Australasian Conference on Artificial Intelligence.

[19] Stefan Schaal,et al. Learning from Demonstration , 1996, NIPS.

[20] Sridhar Mahadevan,et al. Recent Advances in Hierarchical Reinforcement Learning , 2003, Discret. Event Dyn. Syst..

[21] Sridhar Mahadevan,et al. Recent Advances in Hierarchical Reinforcement Learning , 2003, Discret. Event Dyn. Syst..

[22] Oliver Kroemer,et al. Learning to select and generalize striking movements in robot table tennis , 2012, AAAI Fall Symposium: Robots Learning Interactively from Human Teachers.

[23] Martin A. Riedmiller,et al. Reinforcement learning for robot soccer , 2009, Auton. Robots.

[24] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[25] S. Shankar Sastry,et al. Autonomous Helicopter Flight via Reinforcement Learning , 2003, NIPS.

[26] Demis Hassabis,et al. Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[27] Jun Tani,et al. Learning Multiple Goal-Directed Actions Through Self-Organization of a Dynamic Neural Network Model: A Humanoid Robot Experiment , 2008, Adapt. Behav..

[28] Junichi Murata,et al. Novelty-organizing team of classifiers in noisy and dynamic environments , 2015, 2015 IEEE Congress on Evolutionary Computation (CEC).

[29] Nuttapong Chentanez,et al. Intrinsically Motivated Reinforcement Learning , 2004, NIPS.

[30] Ben Tse,et al. Autonomous Inverted Helicopter Flight via Reinforcement Learning , 2004, ISER.

[31] Joshua B. Tenenbaum,et al. Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation , 2016, NIPS.

[32] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[33] Pieter Abbeel,et al. Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[34] Marco Laumanns,et al. Scalable multi-objective optimization test problems , 2002, Proceedings of the 2002 Congress on Evolutionary Computation. CEC'02 (Cat. No.02TH8600).

[35] Evan Dekker,et al. Empirical evaluation methods for multiobjective reinforcement learning algorithms , 2011, Machine Learning.