A Large-scale Open Dataset for Bandit Algorithms

We build and publicize the Open Bandit Dataset and Pipeline to facilitate scalable and reproducible research on bandit algorithms. They are especially suitable for off-policy evaluation (OPE), which attempts to predict the performance of hypothetical algorithms using data generated by a different algorithm. We construct the dataset based on experiments and implementations on a large-scale fashion e-commerce platform, ZOZOTOWN. The data contain the ground-truth about the performance of several bandit policies and enable the fair comparisons of different OPE estimators. We also provide a pipeline to make its implementation easy and consistent. As a proof of concept, we use the dataset and pipeline to implement and evaluate OPE estimators. First, we find that a well-established estimator fails, suggesting that it is critical to choose an appropriate estimator. We then select a well-performing estimator and use it to improve the platform's fashion item recommendation. Our analysis succeeds in finding a counterfactual policy that significantly outperforms the historical ones. Our open data and pipeline will allow researchers and practitioners to easily evaluate and compare their bandit algorithms and OPE estimators with others in a large, real-world setting.

[1]  Masatoshi Uehara,et al.  Intrinsically Efficient, Stable, and Bounded Off-Policy Evaluation for Reinforcement Learning , 2019, NeurIPS.

[2]  Thomas Nedelec,et al.  Offline A/B Testing for Recommender Systems , 2018, WSDM.

[3]  Thorsten Joachims,et al.  Batch learning from logged bandit feedback through counterfactual risk minimization , 2015, J. Mach. Learn. Res..

[4]  Philip S. Thomas,et al.  High Confidence Policy Improvement , 2015, ICML.

[5]  Joseph Kang,et al.  Demystifying Double Robustness: A Comparison of Alternative Strategies for Estimating a Population Mean from Incomplete Data , 2007, 0804.2958.

[6]  Wei Chu,et al.  An unbiased offline evaluation of contextual bandit algorithms with generalized linear models , 2011 .

[7]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  John Langford,et al.  The offset tree for learning with partial labels , 2008, KDD.

[9]  Rajeev Rastogi,et al.  LogUCB: an explore-exploit algorithm for comments recommendation , 2012, CIKM '12.

[10]  M. de Rijke,et al.  Large-scale Validation of Counterfactual Learning Methods: A Test-Bed , 2016, ArXiv.

[11]  Sergey Levine,et al.  Off-Policy Evaluation via Off-Policy Classification , 2019, NeurIPS.

[12]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  John Langford,et al.  Doubly Robust Policy Evaluation and Optimization , 2014, ArXiv.

[14]  Wei Chu,et al.  Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms , 2010, WSDM '11.

[15]  Yisong Yue,et al.  Empirical Study of Off-Policy Policy Evaluation for Reinforcement Learning , 2019, ArXiv.

[16]  Mehrdad Farajtabar,et al.  More Robust Doubly Robust Off-policy Evaluation , 2018, ICML.

[17]  Shota Yasui,et al.  Safe Counterfactual Reinforcement Learning , 2020, ArXiv.

[18]  Philip S. Thomas,et al.  Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning , 2016, ICML.

[19]  John Langford,et al.  Off-policy evaluation for slate recommendation , 2016, NIPS.

[20]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[21]  J M Robins,et al.  Marginal Mean Models for Dynamic Regimes , 2001, Journal of the American Statistical Association.

[22]  Ben Carterette,et al.  Offline Evaluation to Make Decisions About PlaylistRecommendation Algorithms , 2019, WSDM.

[23]  Jure Leskovec,et al.  Open Graph Benchmark: Datasets for Machine Learning on Graphs , 2020, NeurIPS.

[24]  Miroslav Dudík,et al.  Optimal and Adaptive Off-policy Evaluation in Contextual Bandits , 2016, ICML.

[25]  Wei Chu,et al.  A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[26]  Tie-Yan Liu,et al.  LightGBM: A Highly Efficient Gradient Boosting Decision Tree , 2017, NIPS.

[27]  Nan Jiang,et al.  Doubly Robust Off-policy Value Evaluation for Reinforcement Learning , 2015, ICML.

[28]  Masahiro Kato,et al.  Off-Policy Evaluation and Learning for External Validity under a Covariate Shift , 2020, NeurIPS.

[29]  Sergey Levine,et al.  Offline policy evaluation across representations with applications to educational games , 2014, AAMAS.

[30]  Shota Yasui,et al.  Off-policy Bandit and Reinforcement Learning , 2020 .

[31]  Doina Precup,et al.  Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[32]  Shota Yasui,et al.  Efficient Counterfactual Learning from Bandit Feedback , 2018, AAAI.

[33]  Nikos Vlassis,et al.  On the Design of Estimators for Bandit Off-Policy Evaluation , 2019, ICML.

[34]  Yoshua Bengio,et al.  Benchmarking Graph Neural Networks , 2023, J. Mach. Learn. Res..

[35]  Thorsten Joachims,et al.  The Self-Normalized Estimator for Counterfactual Learning , 2015, NIPS.

[36]  Yifei Ma,et al.  Optimal Off-Policy Evaluation for Reinforcement Learning with Marginalized Importance Sampling , 2019, NeurIPS.

[37]  Yao Liu,et al.  Representation Balancing MDPs for Off-Policy Policy Evaluation , 2018, NeurIPS.

[38]  Lihong Li,et al.  An Empirical Evaluation of Thompson Sampling , 2011, NIPS.

[39]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Lihong Li,et al.  Learning from Logged Implicit Exploration Data , 2010, NIPS.

[41]  Joaquin Quiñonero Candela,et al.  Counterfactual reasoning and learning systems: the example of computational advertising , 2013, J. Mach. Learn. Res..