Deriving User- and Content-specific Rewards for Contextual Bandits

Bandit algorithms have gained increased attention in recommender systems, as they provide effective and scalable recommendations. These algorithms use reward functions, usually based on a numeric variable such as click-through rates, as the basis for optimization. On a popular music streaming service, a contextual bandit algorithm is used to decide which content to recommend to users, where the reward function is a binarization of a numeric variable that defines success based on a static threshold of user streaming time: 1 if the user streamed for at least 30 seconds and 0 otherwise. We explore alternative methods to provide a more informed reward function, based on the assumptions that streaming time distribution heavily depends on the type of user and the type of content being streamed. To automatically extract user and content groups from streaming data, we employ ”co-clustering”, an unsupervised learning technique to simultaneously extract clusters of rows and columns from a co-occurrence matrix. The streaming distributions within the co-clusters are then used to define rewards specific to each co-cluster. Our proposed co-clustered based reward functions lead to improvement of over 25% in expected stream rate, compared to the standard binarized rewards.

[1]  Jun Tan,et al.  Stabilizing Reinforcement Learning in Dynamic Environment with Application to Online Recommendation , 2018, KDD.

[2]  Wei Chu,et al.  Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms , 2010, WSDM '11.

[3]  Kathryn B. Laskey,et al.  Latent Dirichlet Bayesian Co-Clustering , 2009, ECML/PKDD.

[4]  Srujana Merugu,et al.  A scalable collaborative filtering framework based on co-clustering , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[5]  Inderjit S. Dhillon,et al.  Co-clustering documents and words using bipartite spectral graph partitioning , 2001, KDD '01.

[6]  Jung-Woo Ha,et al.  Reinforcement Learning based Recommender System using Biclustering Technique , 2018, ArXiv.

[7]  Ryen W. White,et al.  Comparing client and server dwell time estimates for click-level satisfaction prediction , 2014, SIGIR.

[8]  Eugene Agichtein,et al.  Beyond dwell time: estimating document relevance from cursor movements and other post-click searcher behavior , 2012, WWW.

[9]  Arindam Banerjee,et al.  Bayesian Co-clustering , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[10]  Liang Zhang,et al.  Recommendations with Negative Feedback via Pairwise Deep Reinforcement Learning , 2018, KDD.

[11]  Nick Craswell,et al.  Beyond clicks: query reformulation as a predictor of search satisfaction , 2013, CIKM.

[12]  Thomas Nedelec,et al.  A comparative study of counterfactual estimators , 2017, ArXiv.

[13]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[14]  Inderjit S. Dhillon,et al.  Information-theoretic co-clustering , 2003, KDD '03.

[15]  Shuai Li,et al.  Collaborative Filtering Bandits , 2015, SIGIR.

[16]  Milad Shokouhi,et al.  Deep Sequential Models for Task Satisfaction Prediction , 2017, CIKM.

[17]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[18]  Mounia Lalmas,et al.  Understanding User Attention and Engagement in Online News Reading , 2016, WSDM.

[19]  Fabrizio Silvestri,et al.  Improving Post-Click User Engagement on Native Ads via Survival Analysis , 2016, WWW.

[20]  Kenneth Wai-Ting Leung,et al.  CLR: a collaborative location recommendation framework based on co-clustering , 2011, SIGIR.

[21]  Filip Radlinski,et al.  Learning diverse rankings with multi-armed bandits , 2008, ICML '08.

[22]  Yiqun Liu,et al.  User Intent, Behaviour, and Perceived Satisfaction in Product Search , 2018, WSDM.

[23]  Mounia Lalmas,et al.  You must have clicked on this ad by mistake! Data-driven identification of accidental clicks on mobile ads with applications to advertiser cost discounting and click-through rate prediction , 2018, International Journal of Data Science and Analytics.

[24]  Jean Garcia-Gathright,et al.  Understanding and Evaluating User Satisfaction with Music Discovery , 2018, SIGIR.

[25]  James McInerney,et al.  Explore, exploit, and explain: personalizing explainable recommendations with bandits , 2018, RecSys.

[26]  Yiqun Liu,et al.  Different Users, Different Opinions: Predicting Search Satisfaction with Mouse Movement Information , 2015, SIGIR.

[27]  Suju Rajan,et al.  Beyond clicks: dwell time for personalization , 2014, RecSys '14.

[28]  Zheng Wen,et al.  Matroid Bandits: Fast Combinatorial Optimization with Learning , 2014, UAI.

[29]  Thomas Nedelec,et al.  Offline A/B Testing for Recommender Systems , 2018, WSDM.