Combinatorial Bandits for Incentivizing Agents with Dynamic Preferences

The design of personalized incentives or recommendations to improve user engagement is gaining prominence as digital platform providers continually emerge. We propose a multi-armed bandit framework for matching incentives to users, whose preferences are unknown a priori and evolving dynamically in time, in a resource constrained environment. We design an algorithm that combines ideas from three distinct domains: (i) a greedy matching paradigm, (ii) the upper confidence bound algorithm (UCB) for bandits, and (iii) mixing times from the theory of Markov chains. For this algorithm, we provide theoretical bounds on the regret and demonstrate its performance via both synthetic and realistic (matching supply and demand in a bike-sharing platform) examples.

[1]  Martin Pál,et al.  Contextual Multi-Armed Bandits , 2010, AISTATS.

[2]  Y. Narahari,et al.  A Multiarmed Bandit Incentive Mechanism for Crowdsourcing Demand Response in Smart Grids , 2014, AAAI.

[3]  Aleksandrs Slivkins,et al.  Bandits with Knapsacks , 2013, 2013 IEEE 54th Annual Symposium on Foundations of Computer Science.

[4]  Mingyan Liu,et al.  Online Learning of Rested and Restless Bandits , 2011, IEEE Transactions on Information Theory.

[5]  Patrick Hummel,et al.  Learning and incentives in user-generated content: multi-armed bandits with endogenous arms , 2013, ITCS '13.

[6]  Zheng Wen,et al.  Matroid Bandits: Fast Combinatorial Optimization with Learning , 2014, UAI.

[7]  Nicholas R. Jennings,et al.  Efficient crowdsourcing of unknown experts using bounded multi-armed bandits , 2014, Artif. Intell..

[8]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[9]  Kazuoki Azuma WEIGHTED SUMS OF CERTAIN DEPENDENT RANDOM VARIABLES , 1967 .

[10]  Lillian J. Ratliff,et al.  Incentives in the Dark: Multi-armed Bandits for Evolving Users with Unknown Type , 2018, ArXiv.

[11]  Mingyan Liu,et al.  On the Combinatorial Multi-Armed Bandit Problem with Markovian Rewards , 2011, 2011 IEEE Global Telecommunications Conference - GLOBECOM 2011.

[12]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[13]  Andreas Krause,et al.  Incentivizing Users for Balancing Bike Sharing Systems , 2015, AAAI.

[14]  Yajun Wang,et al.  Combinatorial Multi-Armed Bandit and Its Extension to Probabilistically Triggered Arms , 2014, J. Mach. Learn. Res..

[15]  Steven L. Scott,et al.  Multi-armed bandit experiments in the online service economy , 2015 .

[16]  J. A. Fill Eigenvalue bounds on convergence to stationarity for nonreversible markov chains , 1991 .

[17]  Alessandro Lazaric,et al.  Regret Bounds for Reinforcement Learning with Policy Advice , 2013, ECML/PKDD.

[18]  Nicole Immorlica,et al.  Social Status and Badge Design , 2015, WWW.

[19]  Branislav Kveton,et al.  Efficient Learning in Large-Scale Combinatorial Semi-Bandits , 2014, ICML.

[20]  S. Shankar Sastry,et al.  A Multi-Armed Bandit Approach for Online Expert Selection in Markov Decision Processes , 2017, ArXiv.

[21]  Wei Chu,et al.  A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[22]  H. Kuhn The Hungarian method for the assignment problem , 1955 .

[23]  Elizabeth L. Wilmer,et al.  Markov Chains and Mixing Times , 2008 .

[24]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[25]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[26]  Alessandro Lazaric,et al.  Risk-Aversion in Multi-armed Bandits , 2012, NIPS.

[27]  Zheng Wen,et al.  Tight Regret Bounds for Stochastic Combinatorial Semi-Bandits , 2014, AISTATS.

[28]  David S. Johnson,et al.  Computers and In stractability: A Guide to the Theory of NP-Completeness. W. H Freeman, San Fran , 1979 .