Two-stage recommender systems are widely adopted in industry due to their scalability and maintainability. These systems produce recommendations in two steps: (i) multiple nominators preselect a small number of items from a large pool using cheap-to-compute item embeddings; (ii) with a richer set of features, a ranker rearranges the nominated items and serves them to the user. A key challenge of this setup is that optimal performance of each stage in isolation does not imply optimal global performance. In response to this issue, Ma et al. (2020) proposed a nominator training objective importance weighted by the ranker's probability of recommending each item. In this work, we focus on the complementary issue of exploration. Modeled as a contextual bandit problem, we find LinUCB (a near optimal exploration strategy for single-stage systems) may lead to linear regret when deployed in two-stage recommenders. We therefore propose a method of synchronising the exploration strategies between the ranker and the nominators. Our algorithm only relies on quantities already computed by standard LinUCB at each stage and can be implemented in three lines of additional code. We end by demonstrating the effectiveness of our algorithm experimentally.
[1]
Thomas P. Hayes,et al.
Stochastic Linear Optimization under Bandit Feedback
,
2008,
COLT.
[2]
Ed H. Chi,et al.
Off-policy Learning in Two-stage Recommender Systems
,
2020,
WWW.
[3]
Philip M. Long,et al.
Associative Reinforcement Learning using Linear Probabilistic Concepts
,
1999,
ICML.
[4]
Jure Leskovec,et al.
Pixie: A System for Recommending 3+ Billion Items to 200+ Million Users in Real-Time
,
2017,
WWW.
[5]
Ed H. Chi,et al.
Top-K Off-Policy Correction for a REINFORCE Recommender System
,
2018,
WSDM.
[6]
Bo Zhao,et al.
CaSMoS: A Framework for Learning Candidate Selection Models over Structured Queries and Documents
,
2016,
KDD.
[7]
Li Wei,et al.
Sampling-bias-corrected neural modeling for large corpus item recommendations
,
2019,
RecSys.
[8]
Paul Covington,et al.
Deep Neural Networks for YouTube Recommendations
,
2016,
RecSys.
[9]
Wei Chu,et al.
A contextual-bandit approach to personalized news article recommendation
,
2010,
WWW '10.
[10]
Peter Auer,et al.
Using Confidence Bounds for Exploitation-Exploration Trade-offs
,
2003,
J. Mach. Learn. Res..
[11]
P. Dayan,et al.
Cortical substrates for exploratory decisions in humans
,
2006,
Nature.