A deep reinforcement learning framework for allocating buyer impressions in e-commerce websites

We study the problem of allocating impressions to sellers in e-commerce websites, such as Amazon, eBay or Taobao, aiming to maximize the total revenue generated by the platform. When a buyer searches for a keyword, the website presents the buyer with a list of different sellers for this item, together with the corresponding prices. This can be seen as an instance of a resource allocation problem in which the sellers choose their prices at each step and the platform decides how to allocate the impressions, based on the chosen prices and the historical transactions of each seller. Due to the complexity of the system, most e-commerce platforms employ heuristic allocation algorithms that mainly depend on the sellers’ transaction records and without taking the rationality of the sellers into account, which makes them susceptible to several price manipulations. In this paper, we put forward a general framework of designing impression allocation algorithms in e-commerce websites given any behavioural model for the sellers, using deep reinforcement learning. The impression allocation problem is modeled as a Markov decision process, where the states encode the history of impressions, prices, transactions and generated revenue and the actions are the possible impression allocations at each round. To tackle the problem of continuity and high-dimensionality of states and actions, we adopt the ideas of the DDPG algorithm to design an actor-critic gradient policy algorithm which takes advantage of the problem domain in order to achieve covergence and stability. Our algorithm is compared against natural heuristics and it outperforms all of them in terms of the total revenue generated. Finally, contrary to the DDPG algorithm, our algorithm is robust to settings with variable sellers and easy to converge.

[1]  A. Rubinstein Modeling Bounded Rationality , 1998 .

[2]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[3]  Constantinos Daskalakis,et al.  Learning in Auctions: Regret is Hard, Envy is Easy , 2015, 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).

[4]  Pingzhong Tang,et al.  Mechanism Design for Personalized Recommender Systems , 2016, RecSys.

[5]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[6]  Robert Babuska,et al.  Experience Replay for Real-Time Reinforcement Learning Control , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[7]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.

[8]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[9]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[10]  Lior Rokach,et al.  Introduction to Recommender Systems Handbook , 2011, Recommender Systems Handbook.

[11]  David Silver,et al.  Memory-based control with recurrent neural networks , 2015, ArXiv.

[12]  Alan A. Stocker,et al.  Human Decision-Making under Limited Time , 2016, NIPS.

[13]  Eric Maskin,et al.  Mechanism Design: How to Implement Social Goals , 2008 .

[14]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[15]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 2004, Machine Learning.

[16]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[17]  Patrick M. Pilarski,et al.  Model-Free reinforcement learning with continuous action in practice , 2012, 2012 American Control Conference (ACC).

[18]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[19]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[20]  Éva Tardos,et al.  Learning and Efficiency in Games with Dynamic Population , 2015, SODA.

[21]  Éva Tardos,et al.  No-Regret Learning in Bayesian Games , 2015, NIPS.

[22]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[23]  Shalabh Bhatnagar,et al.  Incremental Natural Actor-Critic Algorithms , 2007, NIPS.

[24]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[25]  Éva Tardos,et al.  Econometrics for Learning Agents , 2015, EC.