Stateful Offline Contextual Policy Evaluation and Learning

We study off-policy evaluation and learning from sequential data in a structured class of Markov decision processes that arise from repeated interactions with an exogenous sequence of arrivals with contexts, which generate unknown individual-level responses to agent actions. This model can be thought of as an offline generalization of contextual bandits with resource constraints. We formalize the relevant causal structure of problems such as dynamic personalized pricing and other operations management problems in the presence of potentially high-dimensional user types. The key insight is that an individual-level response is often not causally affected by the state variable and can therefore easily be generalized across timesteps and states. When this is true, we study implications for (doubly robust) off-policy evaluation and learning by instead leveraging single time-step evaluation, estimating the expectation over a single arrival via data from a population, for fitted-value iteration in a marginal MDP. We study sample complexity and analyze error amplification that leads to the persistence, rather than attenuation, of confounding error over time. In simulations of dynamic and capacitated pricing, we show improved out-of-sample policy performance in this class of relevant problems.

[1]  Yu Bai,et al.  Near Optimal Provable Uniform Convergence in Off-Policy Evaluation for Reinforcement Learning , 2021, AISTATS.

[2]  Csaba Szepesvári,et al.  Finite-Time Bounds for Fitted Value Iteration , 2008, J. Mach. Learn. Res..

[3]  Aleksandrs Slivkins,et al.  Bandits with Knapsacks , 2013, 2013 IEEE 54th Annual Symposium on Foundations of Computer Science.

[4]  Marie Davidian,et al.  Robust estimation of optimal dynamic treatment regimes for sequential treatment decisions. , 2013, Biometrika.

[5]  Daniel R. Jiang,et al.  Lookahead-Bounded Q-Learning , 2020, ICML.

[6]  Stefan Wager,et al.  Efficient Policy Learning , 2017, ArXiv.

[7]  Yuval Emek,et al.  Stateful Posted Pricing with Vanishing Regret via Dynamic Deterministic Markov Decision Processes , 2020, NeurIPS.

[8]  Omar Besbes,et al.  Blind Network Revenue Management , 2011, Oper. Res..

[9]  Virag Shah,et al.  Semi-parametric dynamic contextual pricing , 2019, NeurIPS.

[10]  Xinkun Nie,et al.  Learning When-to-Treat Policies , 2019, Journal of the American Statistical Association.

[11]  Sivaraman Balakrishnan,et al.  Semiparametric Counterfactual Density Estimation , 2021, Biometrika.

[12]  Thorsten Joachims,et al.  The Self-Normalized Estimator for Counterfactual Learning , 2015, NIPS.

[13]  Rémi Munos,et al.  Error Bounds for Approximate Policy Iteration , 2003, ICML.

[14]  Masatoshi Uehara,et al.  Fast Rates for the Regret of Offline Reinforcement Learning , 2021, COLT.

[15]  Richard L. Tweedie,et al.  Markov Chains and Stochastic Stability , 1993, Communications and Control Engineering Series.

[16]  W. Marsden I and J , 2012 .

[17]  Nikhil R. Devanur,et al.  An efficient algorithm for contextual bandits with knapsacks, and an extension to concave objectives , 2015, COLT.

[18]  G. A. Young,et al.  High‐dimensional Statistics: A Non‐asymptotic Viewpoint, Martin J.Wainwright, Cambridge University Press, 2019, xvii 552 pages, £57.99, hardback ISBN: 978‐1‐1084‐9802‐9 , 2020, International Statistical Review.

[19]  Nikos Vlassis,et al.  More Efficient Off-Policy Evaluation through Regularized Targeted Learning , 2019, ICML.

[20]  D. Simchi-Levi,et al.  A Statistical Learning Approach to Personalization in Revenue Management , 2015, Manag. Sci..

[21]  J. Robins,et al.  Double/Debiased Machine Learning for Treatment and Structural Parameters , 2017 .

[22]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[23]  Masatoshi Uehara,et al.  Efficiently Breaking the Curse of Horizon in Off-Policy Evaluation with Double Reinforcement Learning , 2019 .

[24]  Renato Paes Leme,et al.  Feature-based Dynamic Pricing , 2016, EC.

[25]  Ameet Talwalkar,et al.  Foundations of Machine Learning , 2012, Adaptive computation and machine learning.

[26]  Huang Bojun Steady State Analysis of Episodic Reinforcement Learning , 2020, NeurIPS 2020.

[27]  Donglin Zeng,et al.  New Statistical Learning Methods for Estimating Optimal Dynamic Treatment Regimes , 2015, Journal of the American Statistical Association.

[28]  Mohsen Bayati,et al.  Dynamic Pricing with Demand Covariates , 2016, 1604.07463.

[29]  Wojciech Zaremba,et al.  OpenAI Gym , 2016, ArXiv.

[30]  Zhengyuan Zhou,et al.  Offline Multi-Action Policy Learning: Generalization and Optimization , 2018, Oper. Res..

[31]  Bruno Scherrer,et al.  Approximate Policy Iteration Schemes: A Comparison , 2014, ICML.

[32]  Garrett J. van Ryzin,et al.  A Multiproduct Dynamic Pricing Problem and Its Applications to Network Yield Management , 1997, Oper. Res..

[33]  Nathan Kallus,et al.  Minimax-Optimal Policy Learning Under Unobserved Confounding , 2020, Manag. Sci..

[34]  Yuhong Yang,et al.  Information-theoretic determination of minimax rates of convergence , 1999 .

[35]  He Wang,et al.  A Re-Solving Heuristic with Uniformly Bounded Loss for Network Revenue Management , 2018, Manag. Sci..

[36]  N. B. Keskin,et al.  Personalized Dynamic Pricing with Machine Learning: High Dimensional Features and Heterogeneous Elasticity , 2020 .

[37]  Adel Javanmard,et al.  Dynamic Pricing in High-Dimensions , 2016, J. Mach. Learn. Res..

[38]  Yisong Yue,et al.  Batch Policy Learning under Constraints , 2019, ICML.

[39]  N. Bora Keskin,et al.  Personalized Dynamic Pricing with Machine Learning: High-Dimensional Features and Heterogeneous Elasticity , 2021, Manag. Sci..

[40]  Robert L. Bray The Multisecretary Problem with Continuous Valuations , 2019 .

[41]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[42]  Shipra Agrawal,et al.  Learning in Structured MDPs with Convex Cost Functions: Improved Regret Bounds for Inventory Management , 2019, EC.

[43]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[44]  Mengdi Wang,et al.  Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation , 2020, ICML.

[45]  James B. Orlin,et al.  Adaptive Data-Driven Inventory Control with Censored Demand Based on Kaplan-Meier Estimator , 2011, Oper. Res..

[46]  Yifei Ma,et al.  Optimal Off-Policy Evaluation for Reinforcement Learning with Marginalized Importance Sampling , 2019, NeurIPS.

[47]  G. Gallego,et al.  Revenue Management and Pricing Analytics , 2019, International Series in Operations Research & Management Science.

[48]  Qiang Liu,et al.  Doubly Robust Bias Reduction in Infinite Horizon Off-Policy Estimation , 2019, ICLR.

[49]  Csaba Szepesvari,et al.  Regularized least-squares regression: Learning from a β-mixing sequence , 2012 .

[50]  Nan Jiang,et al.  Information-Theoretic Considerations in Batch Reinforcement Learning , 2019, ICML.

[51]  Sean P. Meyn Control Techniques for Complex Networks: Workload , 2007 .

[52]  D. Pollard Empirical Processes: Theory and Applications , 1990 .

[53]  J. Cima,et al.  On weak* convergence in ¹ , 1996 .

[54]  Roman Vershynin,et al.  High-Dimensional Probability , 2018 .

[55]  John Langford,et al.  Doubly Robust Policy Evaluation and Optimization , 2014, ArXiv.

[56]  Csaba Szepesvári,et al.  Fitted Q-iteration in continuous action-space MDPs , 2007, NIPS.

[57]  Panos M. Pardalos,et al.  Approximate dynamic programming: solving the curses of dimensionality , 2009, Optim. Methods Softw..

[58]  Masatoshi Uehara,et al.  Double Reinforcement Learning for Efficient Off-Policy Evaluation in Markov Decision Processes , 2019, J. Mach. Learn. Res..

[59]  Rémi Munos,et al.  Performance Bounds in Lp-norm for Approximate Value Iteration , 2007, SIAM J. Control. Optim..

[60]  Philip S. Thomas,et al.  Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning , 2016, ICML.

[61]  Nan Jiang,et al.  Doubly Robust Off-policy Value Evaluation for Reinforcement Learning , 2015, ICML.

[62]  Kostas Bimpikis,et al.  Spatial pricing in ride-sharing networks , 2016, NetEcon@EC.