论文信息 - Deriving User- and Content-specific Rewards for Contextual Bandits

Deriving User- and Content-specific Rewards for Contextual Bandits

Bandit algorithms have gained increased attention in recommender systems, as they provide effective and scalable recommendations. These algorithms use reward functions, usually based on a numeric variable such as click-through rates, as the basis for optimization. On a popular music streaming service, a contextual bandit algorithm is used to decide which content to recommend to users, where the reward function is a binarization of a numeric variable that defines success based on a static threshold of user streaming time: 1 if the user streamed for at least 30 seconds and 0 otherwise. We explore alternative methods to provide a more informed reward function, based on the assumptions that streaming time distribution heavily depends on the type of user and the type of content being streamed. To automatically extract user and content groups from streaming data, we employ ”co-clustering”, an unsupervised learning technique to simultaneously extract clusters of rows and columns from a co-occurrence matrix. The streaming distributions within the co-clusters are then used to define rewards specific to each co-cluster. Our proposed co-clustered based reward functions lead to improvement of over 25% in expected stream rate, compared to the standard binarized rewards.

Mounia Lalmas | Rishabh Mehrotra | Paolo Dragone

[1] Jun Tan,et al. Stabilizing Reinforcement Learning in Dynamic Environment with Application to Online Recommendation , 2018, KDD.

[2] Wei Chu,et al. Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms , 2010, WSDM '11.

[3] Kathryn B. Laskey,et al. Latent Dirichlet Bayesian Co-Clustering , 2009, ECML/PKDD.

[4] Srujana Merugu,et al. A scalable collaborative filtering framework based on co-clustering , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[5] Inderjit S. Dhillon,et al. Co-clustering documents and words using bipartite spectral graph partitioning , 2001, KDD '01.

[6] Jung-Woo Ha,et al. Reinforcement Learning based Recommender System using Biclustering Technique , 2018, ArXiv.

[7] Ryen W. White,et al. Comparing client and server dwell time estimates for click-level satisfaction prediction , 2014, SIGIR.

[8] Eugene Agichtein,et al. Beyond dwell time: estimating document relevance from cursor movements and other post-click searcher behavior , 2012, WWW.

[9] Arindam Banerjee,et al. Bayesian Co-clustering , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[10] Liang Zhang,et al. Recommendations with Negative Feedback via Pairwise Deep Reinforcement Learning , 2018, KDD.

[11] Nick Craswell,et al. Beyond clicks: query reformulation as a predictor of search satisfaction , 2013, CIKM.