Dynamic Pricing under Finite Space Demand Uncertainty: A Multi-Armed Bandit with Dependent Arms

We consider a dynamic pricing problem under unknown demand models. In this problem a seller offers prices to a stream of customers and observes either success or failure in each sale attempt. The underlying demand model is unknown to the seller and can take one of N possible forms. In this paper, we show that this problem can be formulated as a multi-armed bandit with dependent arms. We propose a dynamic pricing policy based on the likelihood ratio test. We show that the proposed policy achieves complete learning, i.e., it offers a bounded regret where regret is defined as the revenue loss with respect to the case with a known demand model. This is in sharp contrast with the logarithmic growing regret in multi-armed bandit with independent arms.

[1]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .

[2]  N. Kiefer,et al.  Controlling a Stochastic Process with Unknown Parameters , 1988 .

[3]  Robert D. Kleinberg Nearly Tight Bounds for the Continuum-Armed Bandit Problem , 2004, NIPS.

[4]  Don H. Johnson,et al.  Symmetrizing the Kullback-Leibler Distance , 2001 .

[5]  J. Michael Harrison,et al.  Bayesian Dynamic Pricing Policies: Learning and Earning Under a Binary Prior Distribution , 2011, Manag. Sci..

[6]  Omar Besbes,et al.  Dynamic Pricing Without Knowing the Demand Function: Risk Bounds and Near-Optimal Algorithms , 2009, Oper. Res..

[7]  Frank Thomson Leighton,et al.  The value of knowing a demand curve: bounds on regret for online posted-price auctions , 2003, 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings..

[8]  Eric W. Cope,et al.  Regret and Convergence Bounds for a Class of Continuum-Armed Bandit Problems , 2009, IEEE Transactions on Automatic Control.

[9]  Peter Auer,et al.  Improved Rates for the Stochastic Continuum-Armed Bandit Problem , 2007, COLT.

[10]  J. Gittins Bandit processes and dynamic allocation indices , 1979 .

[11]  R. Agrawal The Continuum-Armed Bandit Problem , 1995 .

[12]  Vincent K. N. Lau,et al.  Distributive Stochastic Learning for Delay-Optimal OFDMA Power and Subband Allocation , 2010, IEEE Transactions on Signal Processing.

[13]  Yossi Aviv,et al.  A Partially Observed Markov Decision Process for Dynamic Pricing , 2005, Manag. Sci..

[14]  R. Bellman A PROBLEM IN THE SEQUENTIAL DESIGN OF EXPERIMENTS , 1954 .

[15]  K. Arrow,et al.  A Two-Armed Bandit Theory of Market , 2003 .

[16]  Benjamin Van Roy,et al.  Dynamic Pricing with a Prior on Market Response , 2010, Oper. Res..

[17]  Bhaskar Krishnamachari,et al.  Dynamic Multichannel Access With Imperfect Channel State Detection , 2010, IEEE Transactions on Signal Processing.

[18]  Sven Rady,et al.  Optimal Experimentation in a Changing Environment , 1997 .

[19]  Andrew V. Goldberg,et al.  Competitive auctions and digital goods , 2001, SODA '01.

[20]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[21]  Mihaela van der Schaar,et al.  Decomposition Principles and Online Learning in Cross-Layer Optimization for Delay-Sensitive Applications , 2008, IEEE Transactions on Signal Processing.

[22]  Edward J. Sondik,et al.  The Optimal Control of Partially Observable Markov Processes over a Finite Horizon , 1973, Oper. Res..

[23]  A. McLennan Price dispersion and incomplete learning in the long run , 1984 .

[24]  Felix Wu,et al.  Incentive-compatible online auctions for digital goods , 2002, SODA '02.

[25]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[26]  John N. Tsitsiklis,et al.  The Complexity of Markov Decision Processes , 1987, Math. Oper. Res..

[27]  Amos Fiat,et al.  Competitive generalized auctions , 2002, STOC '02.

[28]  Robin J. Evans,et al.  Hidden Markov model multiarm bandits: a methodology for beam scheduling in multitarget tracking , 2001, IEEE Trans. Signal Process..

[29]  Vijay Kumar,et al.  Online learning in online auctions , 2003, SODA '03.

[30]  Hyundong Shin,et al.  Sensing and Probing Cardinalities for Active Cognitive Radios , 2012, IEEE Transactions on Signal Processing.

[31]  H. Robbins,et al.  Asymptotically efficient adaptive allocation rules , 1985 .

[32]  H. Chernoff A Measure of Asymptotic Efficiency for Tests of a Hypothesis Based on the sum of Observations , 1952 .

[33]  Huaiyu Zhu On Information and Sufficiency , 1997 .

[34]  B. Jullien,et al.  OPTIMAL LEARNING BY EXPERIMENTATION , 1991 .