论文信息 - Learning from eXtreme Bandit Feedback

Learning from eXtreme Bandit Feedback

We study the problem of batch learning from bandit feedback in the setting of extremely large action spaces. Learning from extreme bandit feedback is ubiquitous in recommendation systems, in which billions of decisions are made over sets consisting of millions of choices in a single day, yielding massive observational data. In these large-scale real-world applications, supervised learning frameworks such as eXtreme Multi-label Classification (XMC) are widely used despite the fact that they incur significant biases due to the mismatch between bandit feedback and supervised labels. Such biases can be mitigated by importance sampling techniques, but these techniques suffer from impractical variance when dealing with a large number of actions. In this paper, we introduce a selective importance sampling estimator (sIS) that operates in a significantly more favorable bias-variance regime. The sIS estimator is obtained by performing importance sampling on the conditional expectation of the reward with respect to a small subset of actions for each instance (a form of Rao-Blackwellization). We employ this estimator in a novel algorithmic procedure---named Policy Optimization for eXtreme Models (POXM)---for learning from bandit feedback on XMC tasks. In POXM, the selected actions for the sIS estimator are the top-p actions of the logging policy, where p is adjusted from the data and is significantly smaller than the size of the action space. We use a supervised-to-bandit conversion on three XMC datasets to benchmark our POXM method against three competing methods: BanditNet, a previously applied partial matching pruning strategy, and a supervised learning baseline. Whereas BanditNet sometimes improves marginally over the logging policy, our experiments show that POXM systematically and significantly improves over all baselines.

Michael I. Jordan | Romain Lopez | Inderjit Dhillon

[1] John Langford,et al. Doubly Robust Policy Evaluation and Learning , 2011, ICML.

[2] M. de Rijke,et al. Large-scale Validation of Counterfactual Learning Methods: A Test-Bed , 2016, ArXiv.

[3] M. Sklar. Fonctions de repartition a n dimensions et leurs marges , 1959 .

[4] Uri Shalit,et al. Learning Representations for Counterfactual Inference , 2016, ICML.

[5] Yiming Yang,et al. Deep Learning for Extreme Multi-label Text Classification , 2017, SIGIR.

[6] Rahul,et al. A Review of Trends and Techniques in Recommender Systems , 2019, 2019 4th International Conference on Internet of Things: Smart Innovation and Usages (IoT-SIU).

[7] Zihan Zhang,et al. AttentionXML: Label Tree-based Attention-Aware Deep Model for High-Performance Extreme Multi-Label Text Classification , 2019, NeurIPS.

[8] Yuan Qi,et al. Cost-Effective Incentive Allocation via Structured Counterfactual Inference , 2019, AAAI.

[9] Ed H. Chi,et al. Top-K Off-Policy Correction for a REINFORCE Recommender System , 2018, WSDM.

[10] Shanfeng Zhu,et al. HAXMLNet: Hierarchical Attention Network for Extreme Multi-Label Text Classification , 2019, ArXiv.

[11] Rohit Babbar,et al. Bonsai - Diverse and Shallow Trees for Extreme Multi-label Classification , 2019, ArXiv.

[12] A. Zubiaga. Enhancing Navigation on Wikipedia with Social Tags , 2012, ArXiv.

[13] Yue Wang,et al. Beyond Ranking: Optimizing Whole-Page Presentation , 2016, WSDM.

[14] Pieter Abbeel,et al. Constrained Policy Optimization , 2017, ICML.

[15] John Langford,et al. Off-policy evaluation for slate recommendation , 2016, NIPS.

[16] S. Muthukrishnan,et al. Offline Evaluation of Ranking Policies with Click Models , 2018, KDD.

[17] Claudio Gentile,et al. On multilabel classification and ranking with bandit feedback , 2014, J. Mach. Learn. Res..

[18] May D. Wang,et al. Variance Regularized Counterfactual Risk Minimization via Variational Divergence Minimization , 2018, ICML.