Reinforcement Learning in Neural Networks: A Survey

In recent years, researches on reinforcement learning (RL) have focused on bridging the gap between adaptive optimal control and bio-inspired learning techniques. Neural network reinforcement learning (NNRL) is among the most popular algorithms in the RL framework. The advantage of using neural networks enables the RL to search for optimal policies more efficiently in several real-life applications. Although many surveys investigated general RL, no survey is specifically dedicated to the combination of artificial neural networks and RL. This paper therefore describes the state of the art of NNRL algorithms, with a focus on robotics applications. In this paper, a comprehensive survey is started with a discussion on the concepts of RL. Then, a review of several different NNRL algorithms is presented. Afterwards, the performances of different NNRL algorithms are evaluated and compared in learning prediction and learning control tasks from an empirical aspect and the paper concludes with a discussion on open issues.

[1]  Jean-Pascal Pfister,et al.  Sequence learning with hidden units in spiking neural networks , 2011, NIPS.

[2]  Chi-Sing Leung Optimum learning for bidirectional associative memory in the sense of capacity , 1994 .

[3]  Shalabh Bhatnagar,et al.  An actor-critic algorithm with function approximation for discounted cost constrained Markov decision processes , 2010, Syst. Control. Lett..

[4]  Jennie Si,et al.  Handbook of Learning and Approximate Dynamic Programming (IEEE Press Series on Computational Intelligence) , 2004 .

[5]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[6]  Mu-Chun Su,et al.  Neural-network-based fuzzy model and its application to transient stability prediction in power systems , 1999, IEEE Trans. Syst. Man Cybern. Part C.

[7]  Gábor Balázs,et al.  Cascade-Correlation Neural Networks : A Survey , 2010 .

[8]  Xin Zhang,et al.  Data-Driven Robust Approximate Optimal Tracking Control for Unknown General Nonlinear Systems Using Adaptive Dynamic Programming Method , 2011, IEEE Transactions on Neural Networks.

[9]  B. Bakker,et al.  Reinforcement learning by backpropagation through an LSTM model/critic , 2007, 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning.

[10]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[11]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[12]  Pierre Geurts,et al.  Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[13]  Paul J. Werbos,et al.  2009 Special Issue: Intelligence in the brain: A theory of how it works and how to build it , 2009 .

[14]  Marc Toussaint,et al.  Learning model-free robot control by a Monte Carlo EM algorithm , 2009, Auton. Robots.

[15]  Bart De Schutter,et al.  Reinforcement Learning and Dynamic Programming Using Function Approximators , 2010 .

[16]  Jyh-Shing Roger Jang,et al.  ANFIS: adaptive-network-based fuzzy inference system , 1993, IEEE Trans. Syst. Man Cybern..

[17]  Samir Kouro,et al.  Unidimensional Modulation Technique for Cascaded Multilevel Converters , 2009, IEEE Transactions on Industrial Electronics.

[18]  Warren B. Powell,et al.  Reinforcement Learning and Its Relationship to Supervised Learning , 2004 .

[19]  Raúl Rojas,et al.  Neural Networks - A Systematic Introduction , 1996 .

[20]  Shuzhi Sam Ge,et al.  Robust adaptive control of uncertain force/motion constrained nonholonomic mobile manipulators , 2008, Autom..

[21]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[22]  Sander M. Bohte,et al.  Error-backpropagation in temporally encoded networks of spiking neurons , 2000, Neurocomputing.

[23]  Christian Lebiere,et al.  The Cascade-Correlation Learning Architecture , 1989, NIPS.

[24]  C. Christodoulou,et al.  Spiking neural networks with different reinforcement learning (RL) schemes in a multiagent setting. , 2010, The Chinese journal of physiology.

[25]  Mark Ring Two methods for hierarchy learning in reinforcement environments , 1993 .

[26]  Jürgen Schmidhuber,et al.  Training Recurrent Networks by Evolino , 2007, Neural Computation.

[27]  Madan Gopal,et al.  A REINFORCEMENT LEARNING ALGORITHM WITH EVOLVING FUZZY NEURAL NETWORKS , 2014 .

[28]  André da Motta Salles Barreto,et al.  Reinforcement Learning using Kernel-Based Stochastic Factorization , 2011, NIPS.

[29]  Ruya Samli STOCHASTIC NEURAL NETWORKS AND THEIR SOLUTIONS TO OPTIMISATION PROBLEMS , 2012 .

[30]  Zeng-ou Wang A Bidirectional Associative Memory Based on Optimal Linear Associative Memory , 1996, IEEE Trans. Computers.

[31]  Shaocheng Tong,et al.  A DSC Approach to Robust Adaptive NN Tracking Control for Strict-Feedback Nonlinear Systems , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[32]  Hamid R. Berenji,et al.  A convergent actor-critic-based FRL algorithm with application to power management of wireless transmitters , 2003, IEEE Trans. Fuzzy Syst..

[33]  V. Borkar Stochastic approximation with two time scales , 1997 .

[34]  Shuzhi Sam Ge,et al.  Adaptive Robust Output-Feedback Motion/Force Control of Electrically Driven Nonholonomic Mobile Manipulators , 2007, IEEE Transactions on Control Systems Technology.

[35]  Shuzhi Sam Ge,et al.  Adaptive tracking control of uncertain MIMO nonlinear systems with input constraints , 2011, Autom..

[36]  Geoffrey J. Gordon Stable Function Approximation in Dynamic Programming , 1995, ICML.

[37]  Mu-Chun Su Identification of singleton fuzzy models via fuzzy hyperrectangular composite NN , 1997 .

[38]  Andres El-Fakdi,et al.  Semi-online neural-Q/spl I.bar/leaming for real-time robot learning , 2003, Proceedings 2003 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2003) (Cat. No.03CH37453).

[39]  Frank L. Lewis,et al.  Online actor critic algorithm to solve the continuous-time infinite horizon optimal control problem , 2009, 2009 International Joint Conference on Neural Networks.

[40]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[41]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[42]  Gerald Tesauro,et al.  Temporal Difference Learning and TD-Gammon , 1995, J. Int. Comput. Games Assoc..

[43]  Gianluca Baldassarre,et al.  A modular neural-network model of the basal ganglia’s role in learning and selecting motor behaviours , 2002, Cognitive Systems Research.

[44]  P. Lanzi,et al.  Adaptive Agents with Reinforcement Learning and Internal Memory , 2000 .

[45]  Stuart J. Russell,et al.  Reinforcement Learning with Hierarchies of Machines , 1997, NIPS.

[46]  Jose B. Cruz,et al.  Two coding strategies for bidirectional associative memory , 1990, IEEE Trans. Neural Networks.

[47]  Zidong Wang,et al.  Exponential stability of delayed recurrent neural networks with Markovian jumping parameters , 2006 .

[48]  Loredana Zollo,et al.  Hierarchical reinforcement learning and central pattern generators for modeling the development of rhythmic manipulation skills , 2011, 2011 IEEE International Conference on Development and Learning (ICDL).

[49]  Wulfram Gerstner,et al.  Reinforcement Learning Using a Continuous Time Actor-Critic Framework with Spiking Neurons , 2013, PLoS Comput. Biol..

[50]  Steven J. Bradtke,et al.  Linear Least-Squares algorithms for temporal difference learning , 2004, Machine Learning.

[51]  Li Tang,et al.  Adaptive neural network control of robot manipulator using reinforcement learning , 2014 .

[52]  Jennie Si,et al.  Helicopter trimming and tracking control using direct neural dynamic programming , 2003, IEEE Trans. Neural Networks.

[53]  Joy Bose,et al.  An associative memory for the on-line recognition and prediction of temporal sequences , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[54]  Sean P. Meyn,et al.  An analysis of reinforcement learning with function approximation , 2008, ICML '08.

[55]  Leslie Pack Kaelbling,et al.  Practical Reinforcement Learning in Continuous Spaces , 2000, ICML.

[56]  Shalabh Bhatnagar,et al.  Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation , 2009, NIPS.

[57]  Li Li,et al.  Neuro-Fuzzy Dynamic-Inversion-Based Adaptive Control for Robotic Manipulators—Discrete Time Case , 2007, IEEE Transactions on Industrial Electronics.

[58]  Fuchun Sun,et al.  Stable neural-network-based adaptive control for sampled-data nonlinear systems , 1998, IEEE Trans. Neural Networks.

[59]  Shigeo Abe,et al.  A reinforcement learning algorithm for neural networks with incremental learning ability , 2002, Proceedings of the 9th International Conference on Neural Information Processing, 2002. ICONIP '02..

[60]  Andrew G. Barto,et al.  Reinforcement learning , 1998 .

[61]  John N. Tsitsiklis,et al.  Actor-Critic Algorithms , 1999, NIPS.

[62]  Anne Nagel Neural Networks And Fuzzy Systems A Dynamical Systems Approach To Machine Intelligence , 2016 .

[63]  Jerry M. Mendel,et al.  Back-propagation fuzzy system as nonlinear dynamic system identifiers , 1992, [1992 Proceedings] IEEE International Conference on Fuzzy Systems.

[64]  Xinghui Zhang,et al.  Sensitivity to noise in bidirectional associative memory (BAM) , 2005, IEEE Transactions on Neural Networks.

[65]  Kurt Binder,et al.  Monte Carlo Simulation in Statistical Physics , 1992, Graduate Texts in Physics.

[66]  Jose B. Cruz,et al.  Encoding strategy for maximum noise tolerance bidirectional associative memory , 2005, IEEE Transactions on Neural Networks.

[67]  Halbert White,et al.  Learning in Artificial Neural Networks: A Statistical Perspective , 1989, Neural Computation.

[68]  Yasuo Kuniyoshi,et al.  Robust central pattern generators for embodied hierarchical reinforcement learning , 2011, 2011 IEEE International Conference on Development and Learning (ICDL).

[69]  M. Georgiopoulos,et al.  Feed-forward neural networks , 1994, IEEE Potentials.

[70]  Warren E. Dixon,et al.  Asymptotic tracking by a reinforcement learning-based adaptive critic controller , 2011 .

[71]  Jürgen Schmidhuber,et al.  Optimal Ordered Problem Solver , 2002, Machine Learning.

[72]  Michael Aichinger,et al.  Monte Carlo Simulation , 2013 .

[73]  Mahmood Amiri,et al.  BAM Learning of Nonlinearly Separable Tasks by Using an Asymmetrical Output Function and Reinforcement Learning , 2009, IEEE Transactions on Neural Networks.

[74]  Xue Jinlin,et al.  Neurofuzzy velocity tracking control with reinforcement learning , 2009, 2009 9th International Conference on Electronic Measurement & Instruments.

[75]  Z. Ibrahim,et al.  Mobile phone customers churn prediction using elman and Jordan Recurrent Neural Network , 2012, 2012 7th International Conference on Computing and Convergence Technology (ICCCT).

[76]  Justin A. Boyan,et al.  Technical Update: Least-Squares Temporal Difference Learning , 2002, Machine Learning.

[77]  R.J. Williams,et al.  Reinforcement learning is direct adaptive optimal control , 1991, IEEE Control Systems.

[78]  Andrzej J. Kasinski,et al.  Supervised Learning in Spiking Neural Networks with ReSuMe: Sequence Learning, Classification, and Spike Shifting , 2010, Neural Computation.

[79]  Dan Simon,et al.  Computational Modeling and Simulation of Intellect: Current State and Future Perspectives , 2011 .

[80]  Bart Kosko,et al.  Neural networks and fuzzy systems: a dynamical systems approach to machine intelligence , 1991 .

[81]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[82]  Alexandros Giagkos,et al.  From Animals to Animats 14 , 2016, Lecture Notes in Computer Science.

[83]  Ila R Fiete,et al.  Gradient learning in spiking neural networks by dynamic perturbation of conductances. , 2006, Physical review letters.

[84]  Sungchul Kang,et al.  Impedance Learning for Robotic Contact Tasks Using Natural Actor-Critic Algorithm , 2010, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[85]  G. Rizzolatti,et al.  The Organization of the Frontal Motor Cortex. , 2000, News in physiological sciences : an international journal of physiology produced jointly by the International Union of Physiological Sciences and the American Physiological Society.

[86]  Sridhar Mahadevan,et al.  Recent Advances in Hierarchical Reinforcement Learning , 2003, Discret. Event Dyn. Syst..

[87]  Domenico Parisi,et al.  A Bioinspired Hierarchical Reinforcement Learning Architecture for Modeling Learning of Multiple Skills with Continuous States and Actions , 2010, EpiRob.

[88]  Alin Albu-Schäffer,et al.  Human-Like Adaptation of Force and Impedance in Stable and Unstable Interactions , 2011, IEEE Transactions on Robotics.

[89]  Frank L. Lewis,et al.  Online actor-critic algorithm to solve the continuous-time infinite horizon optimal control problem , 2010, Autom..

[90]  Markus Diesmann,et al.  A Spiking Neural Network Model of an Actor-Critic Learning Agent , 2009, Neural Computation.

[91]  Razvan V. Florian,et al.  Reinforcement Learning Through Modulation of Spike-Timing-Dependent Synaptic Plasticity , 2007, Neural Computation.

[92]  Razvan V. Florian A reinforcement learning algorithm for spiking neural networks , 2005, Seventh International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC'05).

[93]  Jagannathan Sarangapani,et al.  Neural Network Control of Nonlinear Discrete-Time Systems , 2018 .

[94]  André Grüning,et al.  Elman Backpropagation as Reinforcement for Simple Recurrent Networks , 2007, Neural Computation.

[95]  Chin-Teng Lin,et al.  Neural-Network-Based Fuzzy Logic Control and Decision System , 1991, IEEE Trans. Computers.

[96]  BART KOSKO,et al.  Bidirectional associative memories , 1988, IEEE Trans. Syst. Man Cybern..

[97]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .

[98]  Steven J. Bradtke,et al.  Incremental dynamic programming for on-line adaptive optimal control , 1995 .

[99]  Pawel Wawrzynski,et al.  Learning population of spiking neural networks with perturbation of conductances , 2013, The 2013 International Joint Conference on Neural Networks (IJCNN).

[100]  Shin Ishii,et al.  Reinforcement learning for a biped robot based on a CPG-actor-critic method , 2007, Neural Networks.

[101]  Mark B. Ring Learning Sequential Tasks by Incrementally Adding Higher Orders , 1992, NIPS.

[102]  Stefan Schaal,et al.  Natural Actor-Critic , 2003, Neurocomputing.

[103]  Zoran Miljkovic,et al.  Neural network Reinforcement Learning for visual control of robot manipulators , 2013, Expert Syst. Appl..

[104]  Csaba Szepesvári,et al.  Algorithms for Reinforcement Learning , 2010, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[105]  Andrew McCallum,et al.  Learning to Use Selective Attention and Short-Term Memory in Sequential Tasks , 1996 .

[106]  John K. Williams,et al.  Reinforcement Learning of Optimal Controls , 2009 .

[107]  Ann Maria Bell,et al.  Reinforcement Learning Rules in a Repeated Game , 2001 .

[108]  F.L. Lewis,et al.  Reinforcement learning and adaptive dynamic programming for feedback control , 2009, IEEE Circuits and Systems Magazine.

[109]  Mounir Boukadoum,et al.  A bidirectional heteroassociative memory for binary and grey-level patterns , 2006, IEEE Transactions on Neural Networks.

[110]  Lyle Noakes,et al.  Continuous-Time Adaptive Critics , 2007, IEEE Transactions on Neural Networks.

[111]  Frank L. Lewis,et al.  Adaptive dynamic programming applied to a 6DoF quadrotor , 2011 .

[112]  Igor Farkas,et al.  Grounding the Meanings in Sensorimotor Behavior using Reinforcement Learning , 2012, Front. Neurorobot..

[113]  Maja J. Matarić,et al.  Learning to Use Selective Attention and Short-Term Memory in Sequential Tasks , 1996 .

[114]  Thomas G. Dietterich Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition , 1999, J. Artif. Intell. Res..

[115]  Farid U. Dowla,et al.  Backpropagation Learning for Multilayer Feed-Forward Neural Networks Using the Conjugate Gradient Method , 1991, Int. J. Neural Syst..

[116]  V. Borkar Stochastic Approximation: A Dynamical Systems Viewpoint , 2008 .