QTAccel: A Generic FPGA based Design for Q-Table based Reinforcement Learning Accelerators

Q-Table based Reinforcement Learning (QRL) is a class of widely used algorithms in AI that work by successively improving the estimates of Q-values – quality of state-action pairs, stored in a table. They significantly outperform Neural Network based techniques when the state space is tractable. Fast learning for AI applications in several domains (such as robotics), with tractable ‘mid-sized’ Q-tables, still necessitates performing a large number of rapid updates. State-of-the-art FPGA implementations of QRL do not scale well with the increasing Q-Table state space. Thus, they are not efficient for such applications. In this work, we develop a novel FPGA based design of QRL and SARSA (State Action Reward State Action), scalable to large state spaces and thereby facilitating a large class of AI applications. Our architecture provides higher throughput while using significantly fewer on-chip resources. It is capable of supporting a variety of action selection policies that covers Q-Learning and variations of bandit algorithms and can be easily extended for multi-agent Q learning. Our pipelined implementation fully handles the dependencies between consecutive updates allowing it to process one sample every clock cycle. We evaluate our architecture for Q-Learning and SARSA algorithms and show that our designs achieve a high throughput of up to 180 million samples per second.

[1]  Zengshi Chen,et al.  Reinforcement Learning: An Introduction: R.S. Sutton, A.G. Barto, MIT Press, Cambridge, MA 1998, 322 pp. ISBN 0-262-19398-1 , 2000, Neurocomputing.

[2]  Apostolos Burnetas,et al.  Optimal Adaptive Policies for Markov Decision Processes , 1997, Math. Oper. Res..

[3]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[4]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[5]  Pranay Reddy Gankidi FPGA Accelerator Architecture for Q-learning and its Applications in Space Exploration Rovers , 2016 .

[6]  Rocco Fazzolari,et al.  An Efficient Hardware Implementation of Reinforcement Learning: The Q-Learning Algorithm , 2019, IEEE Access.

[7]  H Robbins,et al.  Sequential choice from several populations. , 1995, Proceedings of the National Academy of Sciences of the United States of America.

[8]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[9]  Arjun Chandra,et al.  Efficient Parallel Methods for Deep Reinforcement Learning , 2017, ArXiv.

[10]  Viktor K. Prasanna,et al.  Energy performance of FPGAs on PERFECT suite kernels , 2014, 2014 IEEE High Performance Extreme Computing Conference (HPEC).

[11]  Peter Vrancx,et al.  Reinforcement Learning: State-of-the-Art , 2012 .

[12]  Jekan Thangavelautham,et al.  FPGA architecture for deep learning and its application to planetary robotics , 2017, 2017 IEEE Aerospace Conference.

[13]  Jaejin Lee,et al.  FA3C: FPGA-Accelerated Deep Reinforcement Learning , 2019, ASPLOS.

[14]  Yuan Meng,et al.  QTAccel: A Generic FPGA based Design for Q-Table based Reinforcement Learning Accelerators , 2020, 2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[15]  Yantao Shen,et al.  A Comparison of Various Approaches to Reinforcement Learning Algorithms for Multi-robot Box Pushing , 2018, Advances in Engineering Research and Application.

[16]  Shane Legg,et al.  Massively Parallel Methods for Deep Reinforcement Learning , 2015, ArXiv.

[17]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[18]  John Langford,et al.  The Epoch-Greedy Algorithm for Multi-armed Bandits with Side Information , 2007, NIPS.

[19]  Tim Güneysu,et al.  Towards Efficient Arithmetic for Lattice-Based Cryptography on Reconfigurable Hardware , 2012, LATINCRYPT.

[20]  Yuxi Li,et al.  Deep Reinforcement Learning: An Overview , 2017, ArXiv.

[21]  Sergey Levine,et al.  Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[22]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[23]  Elwin Chandra Monie,et al.  Hardware Architecture of Reinforcement Learning Scheme for Dynamic Power Management in Embedded Systems , 2007, EURASIP J. Embed. Syst..

[24]  Marcelo A. C. Fernandes,et al.  Parallel Implementation of Reinforcement Learning Q-Learning Technique for FPGA , 2019, IEEE Access.

[25]  Michael N. Katehakis,et al.  The Multi-Armed Bandit Problem: Decomposition and Computation , 1987, Math. Oper. Res..

[26]  Setareh Maghsudi,et al.  Multi-armed bandits with application to 5G small cells , 2015, IEEE Wireless Communications.

[27]  David B. Thomas,et al.  Neural Network Based Reinforcement Learning Acceleration on FPGA Platforms , 2017, CARN.

[28]  Frederik Vercauteren,et al.  High Precision Discrete Gaussian Sampling on FPGAs , 2013, Selected Areas in Cryptography.

[29]  Kapil R. Dandekar,et al.  Learning State Selection for Reconfigurable Antennas: A Multi-Armed Bandit Approach , 2014, IEEE Transactions on Antennas and Propagation.

[30]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .