Master-slave Deep Architecture for Top-K Multi-armed Bandits with Non-linear Bandit Feedback and Diversity Constraints

We propose a novel master-slave architecture to solve the top-$K$ combinatorial multi-armed bandits problem with non-linear bandit feedback and diversity constraints, which, to the best of our knowledge, is the first combinatorial bandits setting considering diversity constraints under bandit feedback. Specifically, to efficiently explore the combinatorial and constrained action space, we introduce six slave models with distinguished merits to generate diversified samples well balancing rewards and constraints as well as efficiency. Moreover, we propose teacher learning based optimization and the policy co-training technique to boost the performance of the multiple slave models. The master model then collects the elite samples provided by the slave models and selects the best sample estimated by a neural contextual UCB-based network to make a decision with a trade-off between exploration and exploitation. Thanks to the elaborate design of slave models, the co-training mechanism among slave models, and the novel interactions between the master and slave models, our approach significantly surpasses existing state-of-the-art algorithms in both synthetic and real datasets for recommendation tasks. The code is available at: \url{https://github.com/huanghanchi/Master-slave-Algorithm-for-Top-K-Bandits}.

[1]  S. Darak,et al.  Multiarmed Bandit Algorithms on Zynq System-on-Chip: Go Frequentist or Bayesian? , 2022, IEEE Transactions on Neural Networks and Learning Systems.

[2]  Zulong Chen,et al.  SAR-Net: A Scenario-Aware Ranking Network for Personalized Fair Recommendation in Hundreds of Travel Scenarios , 2021, CIKM.

[3]  Xiangnan He,et al.  Causal Incremental Graph Convolution for Recommender System Retraining , 2021, IEEE transactions on neural networks and learning systems.

[4]  Mingsheng Shang,et al.  An L1-and-L2-Norm-Oriented Latent Factor Model for Recommender Systems , 2021, IEEE Transactions on Neural Networks and Learning Systems.

[5]  Sanjay Misra,et al.  A Teaching-Learning-Based Optimization Algorithm for the Weighted Set-Covering Problem , 2020, Tehnicki vjesnik - Technical Gazette.

[6]  Junning Liu,et al.  Progressive Layered Extraction (PLE): A Novel Multi-Task Learning (MTL) Model for Personalized Recommendations , 2020, RecSys.

[7]  Csaba Szepesvari,et al.  Bandit Algorithms , 2020 .

[8]  Safia Kedad-Sidhoum,et al.  Reinforcement Learning for Variable Selection in a Branch and Bound Algorithm , 2020, CPAIOR.

[9]  M. de Rijke,et al.  Cascading Hybrid Bandits: Online Learning to Rank for Relevance and Diversity , 2019, RecSys.

[10]  Yujing Hu,et al.  Multi-Agent Game Abstraction via Graph Attention Neural Network , 2019, AAAI.

[11]  Quanquan Gu,et al.  Neural Contextual Bandits with UCB-based Exploration , 2019, ICML.

[12]  O. Cappé,et al.  Weighted Linear Bandits for Non-Stationary Environments , 2019, NeurIPS.

[13]  A. Rajkumar,et al.  Censored Semi-Bandits: A Framework for Resource Allocation with Censored Feedback , 2019, NeurIPS.

[14]  Masahiro Ono,et al.  Co-training for Policy Learning , 2019, UAI.

[15]  Guangquan Zhang,et al.  A Cross-Domain Recommender System With Kernel-Induced Knowledge Transfer for Overlapping Entities , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[16]  Andrea Lodi,et al.  Exact Combinatorial Optimization with Graph Convolutional Neural Networks , 2019, NeurIPS.

[17]  Y. Mansour,et al.  Top-$k$ Combinatorial Bandits with Full-Bandit Feedback , 2019, ALT.

[18]  Yu Gong,et al.  Exact-K Recommendation via Maximal Clique Optimization , 2019, KDD.

[19]  Max Welling,et al.  Stochastic Beams and Where to Find Them: The Gumbel-Top-k Trick for Sampling Sequences Without Replacement , 2019, ICML.

[20]  Masashi Sugiyama,et al.  Polynomial-Time Algorithms for Multiple-Arm Identification with Full-Bandit Feedback , 2019, Neural Computation.

[21]  Julian Zimmert,et al.  Beating Stochastic and Adversarial Semi-bandits Optimally and Simultaneously , 2019, ICML.

[22]  Jasper Snoek,et al.  DPPNet: Approximating Determinantal Point Processes with Deep Networks , 2019, NeurIPS.

[23]  Ed H. Chi,et al.  Top-K Off-Policy Correction for a REINFORCE Recommender System , 2018, WSDM.

[24]  Vaneet Aggarwal,et al.  Regret Bounds for Stochastic Combinatorial Multi-Armed Bandits with Linear Space Complexity , 2018, ArXiv.

[25]  Zhe Zhao,et al.  Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts , 2018, KDD.

[26]  Wei Zhang,et al.  Master-Slave Curriculum Design for Reinforcement Learning , 2018, IJCAI.

[27]  Gregory Ditzler,et al.  A Sequential Learning Approach for Scaling Up Filter-Based Feature Subset Selection , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[28]  Shie Mannor,et al.  Reward Constrained Policy Optimization , 2018, ICLR.

[29]  Nicholas Jing Yuan,et al.  DRN: A Deep Reinforcement Learning Framework for News Recommendation , 2018, WWW.

[30]  Maria-Florina Balcan,et al.  Learning to Branch , 2018, ICML.

[31]  Yizhou Wang,et al.  Revisiting the Master-Slave Architecture in Multi-Agent Deep Reinforcement Learning , 2017, ArXiv.

[32]  Guy Van den Broeck,et al.  A Semantic Loss Function for Deep Learning with Symbolic Knowledge , 2017, ICML.

[33]  Y. Hoogendoorn The Maximum Coverage Problem , 2017 .

[34]  Yunming Ye,et al.  DeepFM: A Factorization-Machine based Neural Network for CTR Prediction , 2017, IJCAI.

[35]  Samy Bengio,et al.  Neural Combinatorial Optimization with Reinforcement Learning , 2016, ICLR.

[36]  Ben Poole,et al.  Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[37]  Dong Yu,et al.  Deep Crossing: Web-Scale Modeling without Manually Crafted Combinatorial Features , 2016, KDD.

[38]  Heng-Tze Cheng,et al.  Wide & Deep Learning for Recommender Systems , 2016, DLRS@RecSys.

[39]  MengChu Zhou,et al.  A Nonnegative Latent Factor Model for Large-Scale Sparse Matrices in Recommender Systems via Alternating Direction Method , 2016, IEEE Transactions on Neural Networks and Learning Systems.

[40]  Richard Evans,et al.  Deep Reinforcement Learning in Large Discrete Action Spaces , 2015, 1512.07679.

[41]  Wei Cao,et al.  On Top-k Selection in Multi-Armed Bandits and Hidden Bipartite Graphs , 2015, NIPS.

[42]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[43]  Zheng Wen,et al.  Cascading Bandits: Learning to Rank in the Cascade Model , 2015, ICML.

[44]  Wei Chen,et al.  Combinatorial Partial Monitoring Game with Linear Feedback and Its Applications , 2014, ICML.

[45]  Claudio Gentile,et al.  A Gang of Bandits , 2013, NIPS.

[46]  Sham M. Kakade,et al.  Towards Minimax Policies for Online Linear Optimization with Bandit Feedback , 2012, COLT.

[47]  Aditya G. Parameswaran,et al.  Recommendation systems with complex constraints: A course recommendation perspective , 2011, TOIS.

[48]  Wei Chu,et al.  Contextual Bandits with Linear Payoff Functions , 2011, AISTATS.

[49]  Enrique Vidal,et al.  Computation of Normalized Edit Distance and Applications , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[50]  Junchi Yan,et al.  Towards One-shot Neural Combinatorial Solvers: Theoretical and Empirical Notes on the Cardinality-Constrained Case , 2023, ICLR.

[51]  Zheng Wen,et al.  Cascading Linear Submodular Bandits: Accounting for Position Bias and Diversity in Online Learning to Rank , 2019, UAI.

[52]  Baochun Li,et al.  Post: Device Placement with Cross-Entropy Minimization and Proximal Policy Optimization , 2018, NeurIPS.