InBEDE: Integrating Contextual Bandit with TD Learning for Joint Pricing and Dispatch of Ride-Hailing Platforms

For both the traditional street-hailing taxi industry and the recently emerged on-line ride-hailing, it has been a major challenge to improve the ride-hailing marketplace efficiency due to spatio-temporal imbalance between the supply and demand, among other factors. Despite the numerous approaches to improve marketplace efficiency using pricing and dispatch strategies, they usually optimize pricing or dispatch separately. In this paper, we show that these two processes are in fact intrinsically interrelated. Motivated by this observation, we make an attempt to simultaneously optimize pricing and dispatch strategies. However, such a joint optimization is extremely challenging due to the inherent huge scale and lack of a uniform model of the problem. To handle the high complexity brought by the new problem, we propose InBEDE (Integrating contextual Bandit with tEmporal DiffErence learning), a learning framework where pricing strategies are learned via a contextual bandit algorithm, and the dispatch strategies are optimized with the help of temporal difference learning. The two learning components proceed in a mutual bootstrapping manner, in the sense that the policy evaluations of the two components are inter-dependent. Evaluated with real-world datasets of two Chinese cities from Didi Chuxing, an online ride-hailing platform, we show that the market efficiency of the ride-hailing platform can be significantly improved using InBEDE.

[1]  Pingzhong Tang,et al.  Optimal Vehicle Dispatching Schemes via Dynamic Pricing , 2017, ArXiv.

[2]  D. Woodard,et al.  Dynamic pricing and matching in ride‐hailing platforms , 2018, Naval Research Logistics (NRL).

[3]  Zhe Xu,et al.  A Deep Value-network Based Approach for Multi-Driver Order Dispatching , 2019, KDD.

[4]  Andreas Krause,et al.  Information-Theoretic Regret Bounds for Gaussian Process Optimization in the Bandit Setting , 2009, IEEE Transactions on Information Theory.

[5]  Shipra Agrawal,et al.  Thompson Sampling for Contextual Bandits with Linear Payoffs , 2012, ICML.

[6]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[7]  R. Johari,et al.  Pricing in Ride-Share Platforms: A Queueing-Theoretic Approach , 2015 .

[8]  Jieping Ye,et al.  Deep Reinforcement Learning with Knowledge Transfer for Online Rides Order Dispatching , 2018, 2018 IEEE International Conference on Data Mining (ICDM).

[9]  Wei Chu,et al.  A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[10]  J. Munkres ALGORITHMS FOR THE ASSIGNMENT AND TRANSIORTATION tROBLEMS* , 1957 .

[11]  Thodoris Lykouris,et al.  Pricing and Optimization in Shared Vehicle Systems: An Approximation Framework , 2016, EC.

[12]  Jieping Ye,et al.  A Taxi Order Dispatch Model based On Combinatorial Optimization , 2017, KDD.

[13]  Carlos Riquelme,et al.  Pricing in Ride-Sharing Platforms: A Queueing-Theoretic Approach , 2015, EC.

[14]  Ziqi Liao,et al.  Real-time taxi dispatching using Global Positioning Systems , 2003, CACM.

[15]  Raphaël Féraud,et al.  A Neural Networks Committee for the Contextual Bandit Problem , 2014, ICONIP.

[16]  Arthur E. Hoerl,et al.  Ridge Regression: Biased Estimation for Nonorthogonal Problems , 2000, Technometrics.

[17]  Shuai Li,et al.  Distributed Clustering of Linear Bandits in Peer to Peer Networks , 2016, ICML.

[18]  Shuai Li,et al.  Collaborative Filtering Bandits , 2015, SIGIR.

[19]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[20]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[21]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[22]  Steven L. Scott,et al.  A modern Bayesian look at the multi-armed bandit , 2010 .

[23]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[24]  Nello Cristianini,et al.  Finite-Time Analysis of Kernelised Contextual Bandits , 2013, UAI.

[25]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[26]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[27]  M. Keith Chen,et al.  Dynamic Pricing in a Labor Market: Surge Pricing and Flexible Work on the Uber Platform , 2016, EC.

[28]  Dawn B. Woodard,et al.  Dynamic pricing and matching in ride‐hailing platforms , 2019, Naval Research Logistics (NRL).

[29]  Christopher S. Tang,et al.  Coordinating Supply and Demand on an On-Demand Service Platform with Impatient Customers , 2017, Manuf. Serv. Oper. Manag..

[30]  Jun Wang,et al.  Efficient Ridesharing Order Dispatching with Mean Field Multi-Agent Reinforcement Learning , 2019, WWW.

[31]  Marco Pavone,et al.  Control of robotic mobility-on-demand systems: A queueing-theoretical perspective , 2014, Int. J. Robotics Res..

[32]  A. Burnetas,et al.  Optimal Adaptive Policies for Sequential Allocation Problems , 1996 .

[33]  Steven D. Levitt,et al.  Using Big Data to Estimate Consumer Surplus: The Case of Uber , 2016 .

[34]  E. Glen Weyl,et al.  Surge Pricing Solves the Wild Goose Chase , 2017, EC.

[35]  Zhe Xu,et al.  Large-Scale Order Dispatch in On-Demand Ride-Hailing Platforms: A Learning and Planning Approach , 2018, KDD.

[36]  H. Robbins,et al.  Asymptotically efficient adaptive allocation rules , 1985 .

[37]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[38]  Hai Yang,et al.  Nonlinear pricing of taxi services , 2010 .

[39]  David C. Parkes,et al.  Spatio-Temporal Pricing for Ridesharing Platforms , 2018, EC.