Collective Noise Contrastive Estimation for Policy Transfer Learning

We address the problem of learning behaviour policies to optimise online metrics from heterogeneous usage data. While online metrics, e.g., click-through rate, can be optimised effectively using exploration data, such data is costly to collect in practice, as it temporarily degrades the user experience. Leveraging related data sources to improve online performance would be extremely valuable, but is not possible using current approaches. We formulate this task as a policy transfer learning problem, and propose a first solution, called collective noise contrastive estimation (collective NCE). NCE is an efficient solution to approximating the gradient of a logsoftmax objective. Our approach jointly optimises embeddings of heterogeneous data to transfer knowledge from the source domain to the target domain. We demonstrate the effectiveness of our approach by learning an effective policy for an online radio station jointly from user-generated playlists, and usage data collected in an exploration bucket.

[1]  Wei Chu,et al.  Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms , 2010, WSDM '11.

[2]  Lihong Li,et al.  Learning from Logged Implicit Exploration Data , 2010, NIPS.

[3]  Lihong Li,et al.  Offline Evaluation and Optimization for Interactive Systems , 2015, WSDM.

[4]  Peter Stone,et al.  Transfer Learning for Reinforcement Learning Domains: A Survey , 2009, J. Mach. Learn. Res..

[5]  Lihong Li,et al.  Counterfactual Estimation and Optimization of Click Metrics in Search Engines: A Case Study , 2015, WWW.

[6]  Aapo Hyvärinen,et al.  Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics , 2012, J. Mach. Learn. Res..

[7]  Kurt Driessens,et al.  Transfer Learning in Reinforcement Learning Problems Through Partial Policy Recycling , 2007, ECML.

[8]  Thorsten Joachims,et al.  Playlist prediction via metric embedding , 2012, KDD.

[9]  Jun Wang,et al.  Interactive collaborative filtering , 2013, CIKM.

[10]  Eduardo F. Morales,et al.  An Introduction to Reinforcement Learning , 2011 .

[11]  Kristian J. Hammond,et al.  Flytrap: intelligent group music recommendation , 2002, IUI '02.

[12]  Peter Stone,et al.  Cross-domain transfer for reinforcement learning , 2007, ICML '07.

[13]  Yong Yu,et al.  SVDFeature: a toolkit for feature-based collaborative filtering , 2012, J. Mach. Learn. Res..

[14]  Yee Whye Teh,et al.  A fast and simple algorithm for training neural probabilistic language models , 2012, ICML.

[15]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[16]  W. Bruce Croft,et al.  Relevance-Based Language Models , 2001, SIGIR '01.

[17]  Wei Chu,et al.  A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[18]  Ruslan Salakhutdinov,et al.  Probabilistic Matrix Factorization , 2007, NIPS.

[19]  John Langford,et al.  Sample-efficient Nonstationary Policy Evaluation for Contextual Bandits , 2012, UAI.

[20]  Blockin Blockin,et al.  Quick Training of Probabilistic Neural Nets by Importance Sampling , 2003 .

[21]  Peter Stone,et al.  DJ-MC: A Reinforcement-Learning Agent for Music Playlist Recommendation , 2014, AAMAS.

[22]  John Langford,et al.  Exploration scavenging , 2008, ICML '08.

[23]  Yehuda Koren,et al.  Matrix Factorization Techniques for Recommender Systems , 2009, Computer.