Sequential Counterfactual Risk Minimization

Counterfactual Risk Minimization (CRM) is a framework for dealing with the logged bandit feedback problem, where the goal is to improve a logging policy using offline data. In this paper, we explore the case where it is possible to deploy learned policies multiple times and acquire new data. We extend the CRM principle and its theory to this scenario, which we call"Sequential Counterfactual Risk Minimization (SCRM)."We introduce a novel counterfactual estimator and identify conditions that can improve the performance of CRM in terms of excess risk and regret rates, by using an analysis similar to restart strategies in accelerated optimization methods. We also provide an empirical evaluation of our method in both discrete and continuous action settings, and demonstrate the benefits of multiple deployments of CRM.

[1]  Antoine Chambaz,et al.  Risk Minimization from Adaptively Collected Data: Guarantees for Supervised and Policy Learning , 2021, NeurIPS.

[2]  S. Athey,et al.  Policy Learning with Adaptively Collected Data , 2021, Management Science.

[3]  Alexandre d'Aspremont,et al.  Acceleration Methods , 2021, Found. Trends Optim..

[4]  Csaba Szepesvári,et al.  CoinDICE: Off-Policy Confidence Interval Estimation , 2020, NeurIPS.

[5]  Csaba Szepesvari,et al.  Bandit Algorithms , 2020 .

[6]  Thorsten Joachims,et al.  Off-policy Bandits with Deficient Support , 2020, KDD.

[7]  J. Mairal,et al.  Counterfactual Learning of Stochastic Policies with Continuous Actions: from Models to Offline Evaluation , 2020, 2004.11722.

[8]  Yanjun Han,et al.  Sequential Batch Learning in Finite-Action Linear Contextual Bandits , 2020, ArXiv.

[9]  David Simchi-Levi,et al.  Bypassing the Monster: A Faster and Simpler Optimal Algorithm for Contextual Bandits under Realizability , 2020, Math. Oper. Res..

[10]  Daniele Calandriello,et al.  Near-linear Time Gaussian Process Optimization with Adaptive Batching and Resparsification , 2020, ICML.

[11]  Alexander Rakhlin,et al.  Beyond UCB: Optimal and Efficient Contextual Bandits with Regression Oracles , 2020, ICML.

[12]  Elena Smirnova,et al.  Distributionally Robust Counterfactual Risk Minimization , 2019, AAAI.

[13]  Vasilis Syrgkanis,et al.  Semi-Parametric Efficient Policy Learning with Continuous Actions , 2019, NeurIPS.

[14]  Yanjun Han,et al.  Batched Multi-armed Bandits Problem , 2019, NeurIPS.

[15]  Olivier Wintenberger,et al.  Efficient online algorithms for fast-rate regret bounds under sparsity , 2018, NeurIPS.

[16]  Nathan Kallus,et al.  Policy Evaluation and Optimization with Continuous Treatments , 2018, AISTATS.

[17]  Bernhard Schölkopf,et al.  Elements of Causal Inference: Foundations and Learning Algorithms , 2017 .

[18]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[19]  Marc G. Bellemare,et al.  Safe and Efficient Off-Policy Reinforcement Learning , 2016, NIPS.

[20]  Wouter M. Koolen,et al.  MetaGrad: Multiple Learning Rates in Online Learning , 2016, NIPS.

[21]  Marc G. Bellemare,et al.  Q(λ) with Off-Policy Corrections , 2016, ALT.

[22]  Thorsten Joachims,et al.  The Self-Normalized Estimator for Counterfactual Learning , 2015, NIPS.

[23]  Mark D. Reid,et al.  Fast rates in statistical and online learning , 2015, J. Mach. Learn. Res..

[24]  Vianney Perchet,et al.  Batched Bandit Problems , 2015, COLT.

[25]  Michael I. Jordan,et al.  Trust Region Policy Optimization , 2015, ICML.

[26]  Thorsten Joachims,et al.  Counterfactual Risk Minimization , 2015, ICML.

[27]  John Langford,et al.  Doubly Robust Policy Evaluation and Optimization , 2014, ArXiv.

[28]  A. Zeevi,et al.  A Linear Response Bandit Problem , 2013 .

[29]  Nello Cristianini,et al.  Finite-Time Analysis of Kernelised Contextual Bandits , 2013, UAI.

[30]  Yurii Nesterov,et al.  Gradient methods for minimizing composite functions , 2012, Mathematical Programming.

[31]  Joaquin Quiñonero Candela,et al.  Counterfactual reasoning and learning systems: the example of computational advertising , 2012, J. Mach. Learn. Res..

[32]  Wei Chu,et al.  Contextual Bandits with Linear Payoff Functions , 2011, AISTATS.

[33]  John Langford,et al.  Doubly Robust Policy Evaluation and Learning , 2011, ICML.

[34]  Emmanuel J. Candès,et al.  Templates for convex cone problems with applications to sparse signal recovery , 2010, Math. Program. Comput..

[35]  Wei Chu,et al.  A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[36]  Massimiliano Pontil,et al.  Empirical Bernstein Bounds and Sample-Variance Penalization , 2009, COLT.

[37]  Csaba Szepesvári,et al.  Tuning Bandit Algorithms in Stochastic Environments , 2007, ALT.

[38]  Adrian S. Lewis,et al.  The [barred L]ojasiewicz Inequality for Nonsmooth Subanalytic Functions with Applications to Subgradient Dynamical Systems , 2006, SIAM J. Optim..

[39]  Stephen P. Boyd,et al.  Convex Optimization , 2004, IEEE Transactions on Automatic Control.

[40]  Duan Li,et al.  On Restart Procedures for the Conjugate Gradient Method , 2004, Numerical Algorithms.

[41]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[42]  D K Smith,et al.  Numerical Optimization , 2001, J. Oper. Res. Soc..

[43]  D. Horvitz,et al.  A Generalization of Sampling Without Replacement from a Finite Universe , 1952 .

[44]  J. Mairal,et al.  Efficient Kernelized UCB for Contextual Bandits , 2022, AISTATS.

[45]  A. Gleave,et al.  Stable-Baselines3: Reliable Reinforcement Learning Implementations , 2021, J. Mach. Learn. Res..

[46]  Maximilian Kasy,et al.  Supplement for: Adaptive treatment assignment in experiments for policy choice , 2020 .

[47]  B. Karrer,et al.  AE: A domain-agnostic platform for adaptive experimentation , 2018 .

[48]  Y. Nesterov Gradient methods for minimizing composite functions , 2013, Math. Program..

[49]  Eduardo F. Morales,et al.  An Introduction to Reinforcement Learning , 2011 .

[50]  Thomas G. Dietterich Adaptive computation and machine learning , 1998 .

[51]  S. Łojasiewicz Sur la géométrie semi- et sous- analytique , 1993 .