论文信息 - Towards Off-Policy Learning for Ranking Policies with Logged Feedback

Towards Off-Policy Learning for Ranking Policies with Logged Feedback

Probabilistic learning to rank (LTR) has been the dominating approach for optimizing the ranking metric, but cannot maximize long-term rewards. Reinforcement learning models have been proposed to maximize user long-term rewards by formulating the recommendation as a sequential decision-making problem, but could only achieve inferior accuracy compared to LTR counterparts, primarily due to the lack of online interactions and the characteristics of ranking. In this paper, we propose a new off-policy value ranking (VR) algorithm that can simultaneously maximize user long-term rewards and optimize the ranking metric offline for improved sample efficiency in a unified Expectation-Maximization (EM) framework. We theoretically and empirically show that the EM process guides the leaned policy to enjoy the benefit of integration of the future reward and ranking metric, and learn without any online interactions. Extensive offline and online experiments demonstrate the effectiveness of our methods

Suhang Wang | Teng Xiao

[1] Donglin Wang,et al. Learning How to Propagate Messages in Graph Neural Networks , 2021, KDD.

[2] Teng Xiao,et al. A General Offline Reinforcement Learning Framework for Interactive Recommendation , 2021, AAAI.

[3] Balázs Hidasi,et al. Recurrent neural networks , 2013, Scholarpedia.

[4] Weinan Zhang,et al. Interactive Recommender System via Knowledge Graph-enhanced Reinforcement Learning , 2020, SIGIR.

[5] Joemon M. Jose,et al. Self-Supervised Reinforcement Learning for Recommender Systems , 2020, SIGIR.

[6] S. Levine,et al. Conservative Q-Learning for Offline Reinforcement Learning , 2020, NeurIPS.

[7] S. Levine,et al. Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems , 2020, ArXiv.

[8] Dawei Yin,et al. Pseudo Dyna-Q: A Reinforcement Learning Framework for Interactive Recommendation , 2020, WSDM.

[9] Craig Boutilier,et al. RecSim: A Configurable Simulation Platform for Recommender Systems , 2019, ArXiv.

[10] Zaiqiao Meng,et al. Hierarchical Neural Variational Model for Personalized Sequential Recommendation , 2019, WWW.

[11] Doina Precup,et al. Off-Policy Deep Reinforcement Learning without Exploration , 2018, ICML.