论文信息 - QTAccel: A Generic FPGA based Design for Q-Table based Reinforcement Learning Accelerators

QTAccel: A Generic FPGA based Design for Q-Table based Reinforcement Learning Accelerators

Q-Table based Reinforcement Learning (QRL) is a class of widely used algorithms in AI that work by successively improving the estimates of Q-values – quality of state-action pairs, stored in a table. They significantly outperform Neural Network based techniques when the state space is tractable. Fast learning for AI applications in several domains (such as robotics), with tractable ‘mid-sized’ Q-tables, still necessitates performing a large number of rapid updates. State-of-the-art FPGA implementations of QRL do not scale well with the increasing Q-Table state space. Thus, they are not efficient for such applications. In this work, we develop a novel FPGA based design of QRL and SARSA (State Action Reward State Action), scalable to large state spaces and thereby facilitating a large class of AI applications. Our architecture provides higher throughput while using significantly fewer on-chip resources. It is capable of supporting a variety of action selection policies that covers Q-Learning and variations of bandit algorithms and can be easily extended for multi-agent Q learning. Our pipelined implementation fully handles the dependencies between consecutive updates allowing it to process one sample every clock cycle. We evaluate our architecture for Q-Learning and SARSA algorithms and show that our designs achieve a high throughput of up to 180 million samples per second.

[1] Zengshi Chen,et al. Reinforcement Learning: An Introduction: R.S. Sutton, A.G. Barto, MIT Press, Cambridge, MA 1998, 322 pp. ISBN 0-262-19398-1 , 2000, Neurocomputing.

[2] Apostolos Burnetas,et al. Optimal Adaptive Policies for Markov Decision Processes , 1997, Math. Oper. Res..

[3] Leemon C. Baird,et al. Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[4] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[5] Pranay Reddy Gankidi. FPGA Accelerator Architecture for Q-learning and its Applications in Space Exploration Rovers , 2016 .

[6] Rocco Fazzolari,et al. An Efficient Hardware Implementation of Reinforcement Learning: The Q-Learning Algorithm , 2019, IEEE Access.

[7] H Robbins,et al. Sequential choice from several populations. , 1995, Proceedings of the National Academy of Sciences of the United States of America.

[8] Peter Dayan,et al. Q-learning , 1992, Machine Learning.

[9] Arjun Chandra,et al. Efficient Parallel Methods for Deep Reinforcement Learning , 2017, ArXiv.

[10] Viktor K. Prasanna,et al. Energy performance of FPGAs on PERFECT suite kernels , 2014, 2014 IEEE High Performance Extreme Computing Conference (HPEC).

[11] Peter Vrancx,et al. Reinforcement Learning: State-of-the-Art , 2012 .

[12] Jekan Thangavelautham,et al. FPGA architecture for deep learning and its application to planetary robotics , 2017, 2017 IEEE Aerospace Conference.

[13] Jaejin Lee,et al. FA3C: FPGA-Accelerated Deep Reinforcement Learning , 2019, ASPLOS.

[14] Yuan Meng,et al. QTAccel: A Generic FPGA based Design for Q-Table based Reinforcement Learning Accelerators , 2020, 2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[15] Yantao Shen,et al. A Comparison of Various Approaches to Reinforcement Learning Algorithms for Multi-robot Box Pushing , 2018, Advances in Engineering Research and Application.

[16] Shane Legg,et al. Massively Parallel Methods for Deep Reinforcement Learning , 2015, ArXiv.

[17] Sébastien Bubeck,et al. Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[18] John Langford,et al. The Epoch-Greedy Algorithm for Multi-armed Bandits with Side Information , 2007, NIPS.

[19] Tim Güneysu,et al. Towards Efficient Arithmetic for Lattice-Based Cryptography on Reconfigurable Hardware , 2012, LATINCRYPT.

[20] Yuxi Li,et al. Deep Reinforcement Learning: An Overview , 2017, ArXiv.

[21] Sergey Levine,et al. Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[22] Alex Graves,et al. Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[23] Elwin Chandra Monie,et al. Hardware Architecture of Reinforcement Learning Scheme for Dynamic Power Management in Embedded Systems , 2007, EURASIP J. Embed. Syst..

[24] Marcelo A. C. Fernandes,et al. Parallel Implementation of Reinforcement Learning Q-Learning Technique for FPGA , 2019, IEEE Access.

[25] Michael N. Katehakis,et al. The Multi-Armed Bandit Problem: Decomposition and Computation , 1987, Math. Oper. Res..

[26] Setareh Maghsudi,et al. Multi-armed bandits with application to 5G small cells , 2015, IEEE Wireless Communications.

[27] David B. Thomas,et al. Neural Network Based Reinforcement Learning Acceleration on FPGA Platforms , 2017, CARN.

[28] Frederik Vercauteren,et al. High Precision Discrete Gaussian Sampling on FPGAs , 2013, Selected Areas in Cryptography.

[29] Kapil R. Dandekar,et al. Learning State Selection for Reconfigurable Antennas: A Multi-Armed Bandit Approach , 2014, IEEE Transactions on Antennas and Propagation.

[30] Mahesan Niranjan,et al. On-line Q-learning using connectionist systems , 1994 .