The Intrinsic Robustness of Stochastic Bandits to Strategic Manipulation

We study the behavior of stochastic bandits algorithms under \emph{strategic behavior} conducted by rational actors, i.e., the arms. Each arm is a strategic player who can modify its own reward whenever pulled, subject to a cross-period budget constraint. Each arm is \emph{self-interested} and seeks to maximize its own expected number of times of being pulled over a decision horizon. Strategic manipulations naturally arise in various economic applications, e.g., recommendation systems such as Yelp and Amazon. We analyze the robustness of three popular bandit algorithms: UCB, $\varepsilon$-Greedy, and Thompson Sampling. We prove that all three algorithms achieve a regret upper bound $\mathcal{O}(\max \{ B, \ln T\})$ under \emph{any} (possibly adaptive) strategy of the strategic arms, where $B$ is the total budget across arms. Moreover, we prove that our regret upper bound is \emph{tight}. Our results illustrate the intrinsic robustness of bandits algorithms against strategic manipulation so long as $B=o(T)$. This is in sharp contrast to the more pessimistic model of adversarial attacks where an attack budget of $\mathcal{O}(\ln T) $ can trick UCB and $\varepsilon$-Greedy to pull the optimal arm only $o(T)$ number of times. Our results hold for both bounded and unbounded rewards.

[1]  Anupam Gupta,et al.  Better Algorithms for Stochastic Bandits with Adversarial Corruptions , 2019, COLT.

[2]  Siwei Wang,et al.  Multi-armed Bandits with Compensation , 2018, NeurIPS.

[3]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[4]  Vianney Perchet,et al.  Online learning in repeated auctions , 2015, COLT.

[5]  Lihong Li,et al.  Adversarial Attacks on Stochastic Bandits , 2018, NeurIPS.

[6]  Zhiyuan Liu,et al.  Incentivized Exploration for Multi-Armed Bandits under Reward Drift , 2020, AAAI.

[7]  H. Thorisson Coupling, stationarity, and regeneration , 2000 .

[8]  Roi Livni,et al.  Online Pricing with Strategic and Patient Buyers , 2016, NIPS.

[9]  AgrawalShipra,et al.  Near-Optimal Regret Bounds for Thompson Sampling , 2017 .

[10]  Shipra Agrawal,et al.  Analysis of Thompson Sampling for the Multi-armed Bandit Problem , 2011, COLT.

[11]  Nicole Immorlica,et al.  Bayesian Exploration with Heterogeneous Agents , 2019, WWW.

[12]  Lihong Li,et al.  An Empirical Evaluation of Thompson Sampling , 2011, NIPS.

[13]  David Simchi-Levi,et al.  Learning to Optimize under Non-Stationarity , 2018, AISTATS.

[14]  Zheng Wen,et al.  Cascading Bandits: Learning to Rank in the Cascade Model , 2015, ICML.

[15]  Omar Besbes,et al.  Optimal Exploration-Exploitation in a Multi-Armed-Bandit Problem with Non-Stationary Rewards , 2014, Stochastic Systems.

[16]  Wei Chu,et al.  A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[17]  Vijay Kumar,et al.  Online learning in online auctions , 2003, SODA '03.

[18]  Vincent Conitzer,et al.  Complexity Results about Nash Equilibria , 2002, IJCAI.

[19]  Shipra Agrawal,et al.  Near-Optimal Regret Bounds for Thompson Sampling , 2017, J. ACM.

[20]  Yishay Mansour,et al.  Bayesian Incentive-Compatible Bandit Exploration , 2018 .

[21]  Elchanan Ben-Porath The complexity of computing a best response automaton in repeated games with mixed strategies , 1990 .

[22]  S. Matthew Weinberg,et al.  Multi-armed Bandit Problems with Strategic Arms , 2017, COLT.

[23]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[24]  Zhe Feng,et al.  Online Learning for Measuring Incentive Compatibility in Ad Auctions? , 2019, WWW.

[25]  Jonathan Katz,et al.  Rational Secret Sharing, Revisited , 2006, SCN.

[26]  Jon M. Kleinberg,et al.  Incentivizing exploration , 2014, EC.

[27]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[28]  Renato Paes Leme,et al.  Stochastic bandits robust to adversarial corruptions , 2018, STOC.

[29]  Vasilis Syrgkanis,et al.  Learning to Bid Without Knowing your Value , 2017, EC.

[30]  Rómer Rosales,et al.  Simple and Scalable Response Prediction for Display Advertising , 2014, ACM Trans. Intell. Syst. Technol..

[31]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..