Deep Reinforcement Learning With Modulated Hebbian Plus Q-Network Architecture

In this article, we consider a subclass of partially observable Markov decision process (POMDP) problems which we termed confounding POMDPs. In these types of POMDPs, temporal difference (TD)-based reinforcement learning (RL) algorithms struggle, as TD error cannot be easily derived from observations. We solve these types of problems using a new bio-inspired neural architecture that combines a modulated Hebbian network (MOHN) with deep Q-network (DQN), which we call modulated Hebbian plus Q-network architecture (MOHQA). The key idea is to use a Hebbian network with rarely correlated bio-inspired neural traces to bridge temporal delays between actions and rewards when confounding observations and sparse rewards result in inaccurate TD errors. In MOHQA, DQN learns low-level features and control, while the MOHN contributes to high-level decisions by associating rewards with past states and actions. Thus, the proposed architecture combines two modules with significantly different learning algorithms, a Hebbian associative network and a classical DQN pipeline, exploiting the advantages of both. Simulations on a set of POMDPs and on the Malmo environment show that the proposed algorithm improved DQN's results and even outperformed control tests with advantage-actor critic (A2C), quantile regression DQN with long short-term memory (QRDQN + LSTM), Monte Carlo policy gradient (REINFORCE), and aggregated memory for reinforcement learning (AMRL) algorithms on most difficult POMDPs with confounding stimuli and sparse rewards.

[1]  Marc G. Bellemare,et al.  Safe and Efficient Off-Policy Reinforcement Learning , 2016, NIPS.

[2]  Sergey Levine,et al.  Learning deep neural network policies with continuous memory states , 2015, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[3]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[4]  Moshe Dor,et al.  אבן, and: Stone , 2017 .

[5]  David Budden,et al.  Distributed Prioritized Experience Replay , 2018, ICLR.

[6]  Kenneth O. Stanley,et al.  Backpropamine: training self-modifying neural networks with differentiable neuromodulated plasticity , 2018, ICLR.

[7]  H. Markram,et al.  Regulation of Synaptic Efficacy by Coincidence of Postsynaptic APs and EPSPs , 1997, Science.

[8]  Honglak Lee,et al.  Control of Memory, Active Perception, and Action in Minecraft , 2016, ICML.

[9]  Tom Schaul,et al.  Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[10]  Kenneth O. Stanley,et al.  Differentiable plasticity: training plastic neural networks with backpropagation , 2018, ICML.

[11]  Shimon Whiteson,et al.  Deep Variational Reinforcement Learning for POMDPs , 2018, ICML.

[12]  P. Alam ‘T’ , 2021, Composites Engineering: An A–Z Guide.

[13]  Ruslan Salakhutdinov,et al.  Neural Map: Structured Memory for Deep Reinforcement Learning , 2017, ICLR.

[14]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[15]  Jochen J. Steil,et al.  Rare Neural Correlations Implement Robotic Conditioning with Delayed Rewards and Disturbances , 2013, Front. Neurorobot..

[16]  Razvan Pascanu,et al.  Learning to Navigate in Complex Environments , 2016, ICLR.

[17]  Ngo Anh Vien,et al.  A Deep Hierarchical Reinforcement Learning Algorithm in Partially Observable Markov Decision Processes , 2018, IEEE Access.

[18]  Jochen J. Steil,et al.  Learning the rules of a game: Neural conditioning in human-robot interaction with delayed rewards , 2013, 2013 IEEE Third Joint International Conference on Development and Learning and Epigenetic Robotics (ICDL).

[19]  Shane Legg,et al.  Noisy Networks for Exploration , 2017, ICLR.

[20]  P. Alam ‘E’ , 2021, Composites Engineering: An A–Z Guide.

[21]  Kenneth O. Stanley,et al.  From modulated Hebbian plasticity to simple behavior learning through noise and weight saturation , 2012, Neural Networks.

[22]  Razvan Pascanu,et al.  Low-pass Recurrent Neural Networks - A memory architecture for longer-term correlation discovery , 2018, ArXiv.

[23]  Sam Devlin,et al.  AMRL: Aggregated Memory For Reinforcement Learning , 2020, ICLR.

[24]  Tom Schaul,et al.  Prioritized Experience Replay , 2015, ICLR.

[25]  Guangwen Yang,et al.  Episodic Memory Deep Q-Networks , 2018, IJCAI.

[26]  Thomas de Quincey [C] , 2000, The Works of Thomas De Quincey, Vol. 1: Writings, 1799–1820.

[27]  David Pfau,et al.  Bayesian Nonparametric Methods for Partially-Observable Reinforcement Learning , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Doina Precup,et al.  Temporal abstraction in reinforcement learning , 2000, ICML 2000.

[29]  P. Alam ‘A’ , 2021, Composites Engineering: An A–Z Guide.

[30]  Edward J. Sondik,et al.  The optimal control of par-tially observable Markov processes , 1971 .

[31]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[32]  E. Izhikevich Solving the distal reward problem through linkage of STDP and dopamine signaling , 2007, BMC Neuroscience.

[33]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[34]  Katja Hofmann,et al.  The Malmo Platform for Artificial Intelligence Experimentation , 2016, IJCAI.

[35]  Peter Vrancx,et al.  Reinforcement Learning in POMDPs with Memoryless Options and Option-Observation Initiation Sets , 2017, AAAI.

[36]  Jürgen Schmidhuber,et al.  Solving Deep Memory POMDPs with Recurrent Policy Gradients , 2007, ICANN.

[37]  Carl E. Rasmussen,et al.  Data-Efficient Reinforcement Learning in Continuous State-Action Gaussian-POMDPs , 2017, NIPS.

[38]  Carl E. Rasmussen,et al.  PILCO: A Model-Based and Data-Efficient Approach to Policy Search , 2011, ICML.

[39]  Pascal Poupart,et al.  On Improving Deep Reinforcement Learning for POMDPs , 2017, ArXiv.

[40]  P. Alam ‘K’ , 2021, Composites Engineering.

[41]  Tsuyoshi Murata,et al.  {m , 1934, ACML.

[42]  Peter Stone,et al.  Deep Recurrent Q-Learning for Partially Observable MDPs , 2015, AAAI Fall Symposia.

[43]  Shie Mannor,et al.  Learn What Not to Learn: Action Elimination with Deep Reinforcement Learning , 2018, NeurIPS.

[44]  Kamyar Azizzadenesheli,et al.  Experimental results : Reinforcement Learning of POMDPs using Spectral Methods , 2017, ArXiv.

[45]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[46]  Jochen J. Steil,et al.  Solving the Distal Reward Problem with Rare Correlations , 2013, Neural Computation.

[47]  David Silver,et al.  Memory-based control with recurrent neural networks , 2015, ArXiv.

[48]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[49]  Jürgen Schmidhuber,et al.  Recurrent policy gradients , 2010, Log. J. IGPL.

[50]  Danna Zhou,et al.  d. , 1840, Microbial pathogenesis.

[51]  Kenneth D. Miller,et al.  The Role of Constraints in Hebbian Learning , 1994, Neural Computation.

[52]  J. Knott The organization of behavior: A neuropsychological theory , 1951 .

[53]  R. Kempter,et al.  Hebbian learning and spiking neurons , 1999 .