Improving multi-armed bandit algorithms in online pricing settings

Abstract The design of effective bandit algorithms to learn the optimal price is a task of extraordinary importance in all the settings in which the demand curve is not a priori known and the estimation process takes a long time, as customary, e.g., in e-commerce scenarios. In particular, the adoption of effective pricing algorithms may allow companies to increase their profits dramatically. In this paper, we exploit the structure of the pricing problem in online scenarios to improve the performance of state-of-the-art general-purpose bandit algorithms. More specifically, we make use of the monotonicity of the customer demand curve, which suggests the same behavior of the conversion rates, and we exploit the fact that, in many scenarios, companies have a priori information about the order of magnitude of the conversion rate. We design techniques—applicable in principle to any bandit algorithm—capable of exploiting these two properties, and we apply them to Upper Confidence Bound policies both in stationary and nonstationary environments. We show that algorithms exploiting these two properties may significantly outperform state-of-the-art bandit policies in most of the configurations and we also show that the improvement increases as the number of arms increases. In particular, simulations based on real-world data show that our algorithms may increase the profit by 300% or more when compared to the performance achieved by state-of-the-art bandit algorithms. Furthermore, we formally prove that the empirical improvement provided by our algorithms can be achieved without incurring any cost in terms of theoretical guarantees. Indeed, our algorithms present the same asymptotic worst-case regret bounds of the bandit algorithms previously known in the state of the art.

[1]  Csaba Szepesvári,et al.  Exploration-exploitation tradeoff using variance estimates in multi-armed bandits , 2009, Theor. Comput. Sci..

[2]  Csaba Szepesvári,et al.  An adaptive algorithm for finite stochastic partial monitoring , 2012, ICML.

[3]  Christian Schindelhauer,et al.  Discrete Prediction Games with Arbitrary Feedback and Loss , 2001, COLT/EuroCOLT.

[4]  Vijay Kumar,et al.  Online learning in online auctions , 2003, SODA '03.

[5]  Richard Cole,et al.  The sample complexity of revenue maximization , 2014, STOC.

[6]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[7]  Csaba Szepesvári,et al.  Partial Monitoring - Classification, Regret Bounds, and Algorithms , 2014, Math. Oper. Res..

[8]  Shie Mannor,et al.  Unimodal Bandits , 2011, ICML.

[9]  Csaba Szepesvári,et al.  Minimax Regret of Finite Partial-Monitoring Games in Stochastic Environments , 2011, COLT.

[10]  Dean P. Foster,et al.  No Internal Regret via Neighborhood Watch , 2011, AISTATS.

[11]  H. Chernoff A Note on an Inequality Involving the Normal Distribution , 1981 .

[12]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[13]  Noga Alon,et al.  From Bandits to Experts: A Tale of Domination and Independence , 2013, NIPS.

[14]  Alessandro Lazaric,et al.  A truthful learning mechanism for contextual multi-slot sponsored search auctions with externalities , 2012, EC '12.

[15]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[16]  Nicolò Cesa-Bianchi,et al.  Regret Minimization Under Partial Monitoring , 2006, 2006 IEEE Information Theory Workshop - ITW '06 Punta del Este.

[17]  Fan Chung Graham,et al.  Concentration Inequalities and Martingale Inequalities: A Survey , 2006, Internet Math..

[18]  Assaf J. Zeevi,et al.  Dynamic Pricing with an Unknown Demand Model: Asymptotically Optimal Semi-Myopic Policies , 2014, Oper. Res..

[19]  H. Robbins,et al.  Asymptotically efficient adaptive allocation rules , 1985 .

[20]  Rémi Munos,et al.  Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis , 2012, ALT.

[21]  Jialin Liu,et al.  Differential Evolution algorithm applied to non-stationary bandit problem , 2014, 2014 IEEE Congress on Evolutionary Computation (CEC).

[22]  Shie Mannor,et al.  From Bandits to Experts: On the Value of Side-Observations , 2011, NIPS.

[23]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[24]  Aurélien Garivier,et al.  On Upper-Confidence Bound Policies for Non-Stationary Bandit Problems , 2008 .

[25]  Peter S. Fader,et al.  Dynamic Conversion Behavior at E-Commerce Sites , 2004, Manag. Sci..

[26]  Eric Moulines,et al.  On Upper-Confidence Bound Policies for Switching Bandit Problems , 2011, ALT.

[27]  Josef Broder,et al.  Dynamic Pricing Under a General Parametric Choice Model , 2012, Oper. Res..

[28]  Alexandre Proutière,et al.  Unimodal Bandits: Regret Lower Bounds and Optimal Algorithms , 2014, ICML.

[29]  Omar Besbes,et al.  Dynamic Pricing Without Knowing the Demand Function: Risk Bounds and Near-Optimal Algorithms , 2009, Oper. Res..

[30]  Sanmay Das,et al.  Learning the demand curve in posted-price digital goods auctions , 2011, AAMAS.

[31]  Frank Thomson Leighton,et al.  The value of knowing a demand curve: bounds on regret for online posted-price auctions , 2003, 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings..

[32]  Tim Roughgarden,et al.  On the Pseudo-Dimension of Nearly Optimal Auctions , 2015, NIPS.

[33]  Omar Besbes,et al.  On the (Surprising) Sufficiency of Linear Models for Dynamic Pricing with Demand Learning , 2014, Manag. Sci..

[34]  Hsing Kenneth Cheng,et al.  Free Trial or No Free Trial: Optimal Software Product Design with Network Effects , 2010, Eur. J. Oper. Res..