Multi – Armed Bandit for Pricing

This paper is about the study of Multi–Armed Bandit (MAB) approaches for pricing applications, where a seller needs to identify the selling price for a particular kind of item that maximizes her/his profit without knowing the buyer demand. We propose modifications to the popular Upper Confidence Bound (UCB) bandit algorithm exploiting two peculiarities of pricing applications: 1) as the selling price increases it is rational to assume that the probability for the item to be sold decreases; 2) since usually people compare prices from different sellers and track price changes over time before buying (specially for online purchases), the number of times that a certain kind of item is purchased is only a small fraction of the number of times that its price is visualized by potential buyers. Leveraging on these assumptions, we consider refinements of the concentration inequality used in the UCB1 algorithm, that result to be significantly tighter than the original one, specially in the early learning stages when only a few samples are available. We provide empirical evidence on the effectiveness of the proposed variations in speeding up the learning process of UCB1 in pricing applications.

[1]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[2]  H. Chernoff A Note on an Inequality Involving the Normal Distribution , 1981 .

[3]  G. Ryzin,et al.  Optimal dynamic pricing of inventories with stochastic demand over finite horizons , 1994 .

[4]  Benoît Leloup,et al.  Dynamic Pricing on the Internet: Theory and Simulations , 2001, Electron. Commer. Res..

[5]  Christian Schindelhauer,et al.  Discrete Prediction Games with Arbitrary Feedback and Loss , 2001, COLT/EuroCOLT.

[6]  Frank Thomson Leighton,et al.  The value of knowing a demand curve: bounds on regret for online posted-price auctions , 2003, 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings..

[7]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[8]  Nicolò Cesa-Bianchi,et al.  Regret Minimization Under Partial Monitoring , 2006, ITW.

[9]  D. Bertsimas,et al.  Working Paper , 2022 .

[10]  Eric Cope Bayesian strategies for dynamic pricing in e‐commerce , 2007 .

[11]  Csaba Szepesvári,et al.  Minimax Regret of Finite Partial-Monitoring Games in Stochastic Environments , 2011, COLT.

[12]  Qing Zhao,et al.  Dynamic Pricing under Finite Space Demand Uncertainty: A Multi-Armed Bandit with Dependent Arms , 2012, ArXiv.

[13]  Dean P. Foster,et al.  No Internal Regret via Neighborhood Watch , 2011, AISTATS.

[14]  Csaba Szepesvári,et al.  An adaptive algorithm for finite stochastic partial monitoring , 2012, ICML.

[15]  A. V. den Boer,et al.  Dynamic Pricing and Learning: Historical Origins, Current Research, and New Directions , 2013 .

[16]  Csaba Szepesvári,et al.  Partial Monitoring - Classification, Regret Bounds, and Algorithms , 2014, Math. Oper. Res..