论文信息 - Regression Oracles and Exploration Strategies for Short-Horizon Multi-Armed Bandits

Regression Oracles and Exploration Strategies for Short-Horizon Multi-Armed Bandits

This paper explores multi-armed bandit (MAB) strategies in very short horizon scenarios, i.e., when the bandit strategy is only allowed very few interactions with the environment. This is an understudied setting in the MAB literature with many applications in the context of games, such as player modeling. Specifically, we pursue three different ideas. First, we explore the use of regression oracles, which replace the simple average used in strategies such as ϵ-greedy with linear regression models. Second, we examine different exploration patterns such as forced exploration phases. Finally, we introduce a new variant of the UCB1 strategy called UCBT that has interesting properties and no tunable parameters. We present experimental results in a domain motivated by exergames, where the goal is to maximize a player’s daily steps. Our results show that the combination of ϵ-greedy or ϵ-decreasing with regression oracles outperforms all other tested strategies in the short horizon setting.

Santiago Ontañón | Robert C. Gray | Jichen Zhu

[1] Daniele Loiacono,et al. Player Modeling , 2013, Artificial and Computational Intelligence in Games.

[2] T. L. Lai Andherbertrobbins. Asymptotically Efficient Adaptive Allocation Rules , 2022 .

[3] Leslie Pack Kaelbling,et al. Algorithms for multi-armed bandit problems , 2014, ArXiv.

[4] Omar Besbes,et al. Optimal Exploration-Exploitation in a Multi-Armed-Bandit Problem with Non-Stationary Rewards , 2014, Stochastic Systems.

[5] Georgios N. Yannakakis,et al. Player modeling using self-organization in Tomb Raider: Underworld , 2009, 2009 IEEE Symposium on Computational Intelligence and Games.

[6] P. Whittle. Restless Bandits: Activity Allocation in a Changing World , 1988 .

[7] W. R. Thompson. ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[8] Anna Trakoli,et al. Model of Human Occupation: Theory and Application , 2010 .

[9] Jichen Zhu,et al. Exploring Player Trace Segmentation for Dynamic Play Style Prediction , 2021, AIIDE.

[10] Jichen Zhu,et al. Towards Extending Social Exergame Engagement with Agents , 2018, CSCW Companion.

[11] Santiago Ontañón,et al. Player Modeling via Multi-Armed Bandits , 2020, FDG.

[12] Greta C Bernatz,et al. How humans walk: bout duration, steps per bout, and rest duration. , 2008, Journal of rehabilitation research and development.

[13] Predrag Klasnja,et al. Rapidly Personalizing Mobile Health Treatment Policies with Limited Data , 2020, ArXiv.

[14] Simon M. Lucas,et al. A Survey of Monte Carlo Tree Search Methods , 2012, IEEE Transactions on Computational Intelligence and AI in Games.

[15] H. Robbins. Some aspects of the sequential design of experiments , 1952 .

[16] J. Langford,et al. The Epoch-Greedy algorithm for contextual multi-armed bandits , 2007, NIPS 2007.

[17] Peter Auer,et al. Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[18] Gary Kielhofner,et al. Model of human occupation : theory and application , 1985 .

[19] J. -L. Guo,et al. Weblog patterns and human dynamics with decreasing interest , 2010, 1008.0042.