Batch-Constrained Distributional Reinforcement Learning for Session-based Recommendation

Most of the existing deep reinforcement learning (RL) approaches for session-based recommendations either rely on costly online interactions with real users, or rely on potentially biased rule-based or data-driven user-behavior models for learning. In this work, we instead focus on learning recommendation policies in the pure batch or offline setting, i.e. learning policies solely from offline historical interaction logs or batch data generated from an unknown and sub-optimal behavior policy, without further access to data from the real-world or user-behavior models. We propose BCD4Rec: Batch-Constrained Distributional RL for Session-based Recommendations. BCD4Rec builds upon the recent advances in batch (offline) RL and distributional RL to learn from offline logs while dealing with the intrinsically stochastic nature of rewards from the users due to varied latent interest preferences (environments). We demonstrate that BCD4Rec significantly improves upon the behavior policy as well as strong RL and non-RL baselines in the batch setting in terms of standard performance metrics like Click Through Rates or Buy Rates. Other useful properties of BCD4Rec include: i. recommending items from the correct latent categories indicating better value estimates despite large action space (of the order of number of items), and ii. overcoming popularity bias in clicked or bought items typically present in the offline logs.

[1]  Deborah Estrin,et al.  Unbiased offline recommender evaluation for missing-not-at-random implicit feedback , 2018, RecSys.

[2]  Hongning Wang,et al.  Model-Based Reinforcement Learning with Adversarial Training for Online Recommendation , 2019, NeurIPS.

[3]  Guy Shani,et al.  An MDP-Based Recommender System , 2002, J. Mach. Learn. Res..

[4]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[5]  Rémi Munos,et al.  Implicit Quantile Networks for Distributional Reinforcement Learning , 2018, ICML.

[6]  John Langford,et al.  Off-policy evaluation for slate recommendation , 2016, NIPS.

[7]  Xing Xie,et al.  Session-based Recommendation with Graph Neural Networks , 2018, AAAI.

[8]  Diksha Garg,et al.  Sequence and Time Aware Neighborhood for Session-based Recommendations: STAN , 2019, SIGIR.

[9]  Marc G. Bellemare,et al.  Distributional Reinforcement Learning with Quantile Regression , 2017, AAAI.

[10]  Liang Zhang,et al.  Recommendations with Negative Feedback via Pairwise Deep Reinforcement Learning , 2018, KDD.

[11]  Ulf Brefeld,et al.  Factored MDPs for detecting topics of user sessions , 2014, RecSys '14.

[12]  Sergey Levine,et al.  Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems , 2020, ArXiv.

[13]  Liang Zhang,et al.  Deep Reinforcement Learning for List-wise Recommendations , 2017, ArXiv.

[14]  Long Ji Lin,et al.  Self-improving reactive agents based on reinforcement learning, planning and teaching , 1992, Machine Learning.

[15]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[16]  Masashi Sugiyama,et al.  Nonparametric Return Distribution Approximation for Reinforcement Learning , 2010, ICML.

[17]  Craig Boutilier,et al.  Reinforcement Learning for Slate-based Recommender Systems: A Tractable Decomposition and Practical Methodology , 2019, ArXiv.

[18]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[19]  Omer Levy,et al.  word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method , 2014, ArXiv.

[20]  Dietmar Jannach,et al.  Evaluation of session-based recommendation algorithms , 2018, User Modeling and User-Adapted Interaction.

[21]  Nando de Freitas,et al.  Critic Regularized Regression , 2020, NeurIPS.

[22]  Craig Boutilier,et al.  RecSim: A Configurable Simulation Platform for Recommender Systems , 2019, ArXiv.

[23]  Alexandros Karatzoglou,et al.  Recurrent Neural Networks with Top-k Gains for Session-based Recommendations , 2017, CIKM.

[24]  P. J. Huber Robust Estimation of a Location Parameter , 1964 .

[25]  Diksha Garg,et al.  NISER: Normalized Item and Session Representations to Handle Popularity Bias. , 2019 .

[26]  Tao Li,et al.  Causality and Batch Reinforcement Learning: Complementary Approaches To Planning In Unknown Domains , 2020, ArXiv.

[27]  Gang Chen,et al.  Off-Policy Recommendation System Without Exploration , 2020, PAKDD.

[28]  Sergey Levine,et al.  Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction , 2019, NeurIPS.

[29]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[30]  Natasha Jaques,et al.  Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog , 2019, ArXiv.

[31]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[32]  Mohammad Norouzi,et al.  An Optimistic Perspective on Offline Deep Reinforcement Learning , 2020, International Conference on Machine Learning.

[33]  Yuan Qi,et al.  Generative Adversarial User Model for Reinforcement Learning Based Recommendation System , 2018, ICML.

[34]  Wei Chu,et al.  Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms , 2010, WSDM '11.

[35]  Liang Zhang,et al.  Deep reinforcement learning for page-wise recommendations , 2018, RecSys.

[36]  Diksha Garg,et al.  NISER: Normalized Item and Session Representations with Graph Neural Networks , 2019, ArXiv.

[37]  Qiao Liu,et al.  STAMP: Short-Term Attention/Memory Priority Model for Session-based Recommendation , 2018, KDD.

[38]  Joelle Pineau,et al.  Benchmarking Batch Deep Reinforcement Learning Algorithms , 2019, ArXiv.

[39]  Nicholas Jing Yuan,et al.  DRN: A Deep Reinforcement Learning Framework for News Recommendation , 2018, WWW.

[40]  Mohammad Norouzi,et al.  An Optimistic Perspective on Offline Reinforcement Learning , 2020, ICML.

[41]  Martha White,et al.  An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning , 2015, J. Mach. Learn. Res..

[42]  Frederick R. Forst,et al.  On robust estimation of the location parameter , 1980 .

[43]  Marc G. Bellemare,et al.  A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[44]  Nando de Freitas,et al.  Hyperparameter Selection for Offline Reinforcement Learning , 2020, ArXiv.

[45]  George Tucker,et al.  Conservative Q-Learning for Offline Reinforcement Learning , 2020, NeurIPS.

[46]  Doina Precup,et al.  Off-Policy Deep Reinforcement Learning without Exploration , 2018, ICML.

[47]  Thorsten Joachims,et al.  MOReL : Model-Based Offline Reinforcement Learning , 2020, NeurIPS.