Stabilizing Reinforcement Learning in Dynamic Environment with Application to Online Recommendation

Deep reinforcement learning has shown great potential in improving system performance autonomously, by learning from iterations with the environment. However, traditional reinforcement learning approaches are designed to work in static environments. In many real-world problems, the environments are commonly dynamic, in which the performance of reinforcement learning approaches can degrade drastically. A direct cause of the performance degradation is the high-variance and biased estimation of the reward, due to the distribution shifting in dynamic environments. In this paper, we propose two techniques to alleviate the unstable reward estimation problem in dynamic environments, the stratified sampling replay strategy and the approximate regretted reward, which address the problem from the sample aspect and the reward aspect, respectively. Integrating the two techniques with Double DQN, we propose the Robust DQN method. We apply Robust DQN in the tip recommendation system in Taobao online retail trading platform. We firstly disclose the highly dynamic property of the recommendation application. We then carried out online A/B test to examine Robust DQN. The results show that Robust DQN can effectively stabilize the value estimation and, therefore, improves the performance in this real-world dynamic environment.

[1]  Andreas Karlsson Survey sampling: theory and methods , 2008 .

[2]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[3]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[4]  Marco Wiering,et al.  Q-learning with experience replay in a dynamic environment , 2016, 2016 IEEE Symposium Series on Computational Intelligence (SSCI).

[5]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[6]  J. O’Neill,et al.  Play it again: reactivation of waking experience and memory , 2010, Trends in Neurosciences.

[7]  Benjamin Van Roy,et al.  Deep Exploration via Bootstrapped DQN , 2016, NIPS.

[8]  Tom Schaul,et al.  Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[9]  Jung-Woo Ha,et al.  Reinforcement Learning based Recommender System using Biclustering Technique , 2018, ArXiv.

[10]  Masato Nagayoshi,et al.  Reinforcement learning for dynamic environment: a classification of dynamic environments and a detection method of environmental changes , 2013, Artificial Life and Robotics.

[11]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[12]  C. Watkins Learning from delayed rewards , 1989 .

[13]  Laxman Sahoo,et al.  A Survey on Recommendation System , 2017 .

[14]  Hado van Hasselt,et al.  Double Q-learning , 2010, NIPS.

[15]  Edward R. Dougherty,et al.  Effect of separate sampling on classification accuracy , 2014, Bioinform..

[16]  Yujing Hu,et al.  Reinforcement Learning to Rank in E-Commerce Search Engine: Formalization, Analysis, and Application , 2018, KDD.

[17]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[18]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[19]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[20]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[21]  Peter Stone,et al.  DJ-MC: A Reinforcement-Learning Agent for Music Playlist Recommendation , 2014, AAMAS.

[22]  Tom Schaul,et al.  Prioritized Experience Replay , 2015, ICLR.

[23]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[24]  Sherief Abdallah,et al.  Addressing Environment Non-Stationarity by Repeating Q-learning Updates , 2016, J. Mach. Learn. Res..

[25]  Marco Wiering,et al.  Reinforcement Learning in Dynamic Environments using Instantiated Information , 2001, ICML.