Optimal recommendation to users that react: Online learning for a class of POMDPs

We describe and study a model for an Automated Online Recommendation System (AORS) in which a user's preferences can be time-dependent and can also depend on the history of past recommendations and play-outs. The three key features of the model that makes it more realistic compared to existing models for recommendation systems are (1) user preference is inherently latent, (2) current recommendations can affect future preferences, and (3) it allows for the development of learning algorithms with provable performance guarantees. The problem is cast as an average-cost restless multi-armed bandit for a given user, with an independent partially observable Markov decision process (POMDP) for each item of content. We analyze the POMDP for a single arm, describe its structural properties, and characterize its optimal policy. We then develop a Thompson sampling-based online reinforcement learning algorithm to learn the parameters of the model and optimize utility from the binary responses of the users to continuous recommendations. We then analyze the performance of the learning algorithm and characterize the regret. Illustrative numerical results and directions for extension to the restless hidden Markov multi-armed bandit problem are also presented.

[1]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[2]  Qing Zhao,et al.  Indexability of Restless Bandit Problems and Optimality of Whittle Index for Dynamic Multichannel Access , 2008, IEEE Transactions on Information Theory.

[3]  Robin Burke,et al.  Context-aware music recommendation based on latenttopic sequential patterns , 2012, RecSys.

[4]  Marc Lelarge,et al.  Leveraging Side Observations in Stochastic Bandits , 2012, UAI.

[5]  Csaba Szepesvári,et al.  Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.

[6]  J. Langford,et al.  The Epoch-Greedy algorithm for contextual multi-armed bandits , 2007, NIPS 2007.

[7]  Tara Javidi,et al.  Optimality of myopic policy for a class of monotone affine restless multi-armed bandits , 2012, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[8]  Shipra Agrawal,et al.  Analysis of Thompson Sampling for the Multi-armed Bandit Problem , 2011, COLT.

[9]  Shie Mannor,et al.  Thompson Sampling for Complex Online Problems , 2013, ICML.

[10]  D. Manjunath,et al.  A restless bandit with no observable states for recommendation systems and communication link scheduling , 2015, 2015 54th IEEE Conference on Decision and Control (CDC).

[11]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[12]  Michael J. Neely,et al.  Network utility maximization over partially observable Markovian channels , 2013, Perform. Evaluation.

[13]  Wei Chu,et al.  A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[14]  Shie Mannor,et al.  Thompson Sampling for Learning Parameterized Markov Decision Processes , 2014, COLT.

[15]  W. Rudin Principles of mathematical analysis , 1964 .

[16]  Michael J. Neely,et al.  Network utility maximization over partially observable Markovian channels , 2010, 2011 International Symposium of Modeling and Optimization of Mobile, Ad Hoc, and Wireless Networks.

[17]  Vivek S. Borkar,et al.  Whittle index policy for crawling ephemeral content , 2015, 2015 54th IEEE Conference on Decision and Control (CDC).

[18]  Benjamin Van Roy,et al.  Model-based Reinforcement Learning and the Eluder Dimension , 2014, NIPS.

[19]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[20]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[21]  D. Manjunath,et al.  On the Whittle Index for Restless Multiarmed Hidden Markov Bandits , 2016, IEEE Transactions on Automatic Control.