Capacity-aware Sequential Recommendations

Personalized recommendations are increasingly important to engage users and guide them through large systems, for example when recommending points of interest to tourists visiting a popular city. To maximize long-term user experience, the system should consider issuing recommendations sequentially, since by observing the user's response to a recommendation, the system can update its estimate of the user's (latent) interests. However, as traditional recommender systems target individuals, their effect on a collective of users can unintentionally overload capacity. Therefore, recommender systems should not only consider the users' interests, but also the effect of recommendations on the available capacity. The structure in such a constrained, multi-agent, partially observable decision problem can be exploited by a novel belief-space sampling algorithm which bounds the size of the state space by a limit on regret. By exploiting the stationary structure of the problem, our algorithm is significantly more scalable than existing approximate solvers. Moreover, by explicitly considering the information value of actions, this algorithm significantly improves the quality of recommendations over an extension of posterior sampling reinforcement learning to the constrained multi-agent case. We show how to decouple constraint satisfaction from sequential recommendation policies, resulting in algorithms which issue recommendations to thousands of agents while respecting constraints.

[1]  Liang Tang,et al.  Automatic ad format selection via contextual bandits , 2013, CIKM.

[2]  Guy Shani,et al.  An MDP-Based Recommender System , 2002, J. Mach. Learn. Res..

[3]  Edward J. Sondik,et al.  The optimal control of par-tially observable Markov processes , 1971 .

[4]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[5]  Mathijs de Weerdt,et al.  Best-Response Planning of Thermostatically Controlled Loads under Power Constraints , 2015, AAAI.

[6]  Benjamin Van Roy,et al.  Near-optimal Reinforcement Learning in Factored MDPs , 2014, NIPS.

[7]  E. Altman Constrained Markov Decision Processes , 1999 .

[8]  John N. Tsitsiklis,et al.  The Complexity of Markov Decision Processes , 1987, Math. Oper. Res..

[9]  David A. Shamma,et al.  YFCC100M , 2015, Commun. ACM.

[10]  Matthijs T. J. Spaan,et al.  Column Generation Algorithms for Constrained POMDPs , 2018, J. Artif. Intell. Res..

[11]  Malcolm J. A. Strens,et al.  A Bayesian Framework for Reinforcement Learning , 2000, ICML.

[12]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[13]  Pascal Poupart,et al.  Point-Based Value Iteration for Continuous POMDPs , 2006, J. Mach. Learn. Res..

[14]  Peter L. Bartlett,et al.  Fast-Tracking Stationary MOMDPs for Adaptive Management Problems , 2016, AAAI.

[15]  Peter Dayan,et al.  Scalable and Efficient Bayes-Adaptive Reinforcement Learning Based on Monte-Carlo Tree Search , 2013, J. Artif. Intell. Res..

[16]  Kee-Eung Kim,et al.  Approximate Linear Programming for Constrained Partially Observable Markov Decision Processes , 2015, AAAI.

[17]  P. Randolph Bayesian Decision Problems and Markov Chains , 1968 .

[18]  Mathijs de Weerdt,et al.  Preallocation and Planning Under Stochastic Resource Constraints , 2018, AAAI.

[19]  Shimon Whiteson,et al.  A Survey of Multi-Objective Sequential Decision-Making , 2013, J. Artif. Intell. Res..

[20]  Andrew G. Barto,et al.  Optimal learning: computational procedures for bayes-adaptive markov decision processes , 2002 .

[21]  Bamshad Mobasher,et al.  Context adaptation in interactive recommender systems , 2014, RecSys '14.

[22]  Zheng Wen,et al.  An Interactive Points of Interest Guidance System , 2017, IUI Companion.

[23]  Shie Mannor,et al.  Thompson Sampling for Learning Parameterized Markov Decision Processes , 2014, COLT.

[24]  R. Bellman A Markovian Decision Process , 1957 .

[25]  Hoong Chuin Lau,et al.  An agent-based simulation approach to experience management in theme parks , 2013, 2013 Winter Simulations Conference (WSC).

[26]  Dana Ron,et al.  The power of amnesia: Learning probabilistic automata with variable memory length , 1996, Machine Learning.

[27]  Ronald A. Howard,et al.  Information Value Theory , 1966, IEEE Trans. Syst. Sci. Cybern..

[28]  Harald Steck,et al.  Evaluation of recommendations: rating-prediction and ranking , 2013, RecSys.

[29]  David Hsu,et al.  SARSOP: Efficient Point-Based POMDP Planning by Approximating Optimally Reachable Belief Spaces , 2008, Robotics: Science and Systems.

[30]  Mathijs de Weerdt,et al.  Bounding the Probability of Resource Constraint Violations in Multi-Agent MDPs , 2017, AAAI.

[31]  Olivier Buffet,et al.  MOMDPs: A Solution for Modelling Adaptive Management Problems , 2012, AAAI.

[32]  E. Silver MARKOVIAN DECISION PROCESSES WITH UNCERTAIN TRANSITION PROBABILITIES OR REWARDS , 1963 .

[33]  Edmund H. Durfee,et al.  Minimizing Maximum Regret in Commitment Constrained Sequential Decision Making , 2017, ICAPS.

[34]  David Andre,et al.  Model based Bayesian Exploration , 1999, UAI.

[35]  David Hsu,et al.  Monte Carlo Value Iteration for Continuous-State POMDPs , 2010, WAFR.

[36]  David Hsu,et al.  Planning under Uncertainty for Robotic Tasks with Mixed Observability , 2010, Int. J. Robotics Res..

[37]  Craig Boutilier,et al.  Decision-Theoretic Planning: Structural Assumptions and Computational Leverage , 1999, J. Artif. Intell. Res..

[38]  Sham M. Kakade,et al.  On the sample complexity of reinforcement learning. , 2003 .

[39]  Alan R. Washburn,et al.  The LP/POMDP marriage: Optimization with imperfect information , 2000 .

[40]  R. Gomory,et al.  A Linear Programming Approach to the Cutting-Stock Problem , 1961 .