A User Comfort Model and Index Policy for Personalizing Discrete Controller Decisions

User feedback allows for tailoring system operation to ensure individual user satisfaction. A major challenge in personalized decision-making is the systematic construction of a user model during operation while maintaining control performance. This paper presents both an index-based control policy to smartly collect and process user feedback and a user comfort model in the form of a Markov decision process with a priori unknown user-specific state transition probabilities. The control policy utilizes explicit user feedback to optimize a reward measure reflecting user comfort and addresses the explorationexploitation trade-off in a multi-armed bandit framework. The proposed approach combines restless bandits and upper confidence bound algorithms. It introduces an exploration term into the restless bandit formulation, utilizes user feedback to identify the user model, and is shown to be indexable. We demonstrate its capabilities with a simulation for learning a user's trade-off between comfort and energy usage.

[1]  Martin Ester,et al.  TrustWalker: a random walk model for combining trust-based and item-based recommendation , 2009, KDD.

[2]  Volkan Cevher,et al.  Time-Varying Gaussian Process Bandit Optimization , 2016, AISTATS.

[3]  J. Gittins Bandit processes and dynamic allocation indices , 1979 .

[4]  P. Whittle Restless bandits: activity allocation in a changing world , 1988, Journal of Applied Probability.

[5]  R. Agrawal Sample mean based index policies by O(log n) regret for the multi-armed bandit problem , 1995, Advances in Applied Probability.

[6]  Joshua A. Taylor,et al.  Index Policies for Demand Response , 2014, IEEE Transactions on Power Systems.

[7]  Dan J. Kim,et al.  A trust-based consumer decision-making model in electronic commerce: The role of trust, perceived risk, and their antecedents , 2019 .

[8]  E. Feron,et al.  Multi-UAV dynamic routing with partial observations using restless bandit allocation indices , 2008, 2008 American Control Conference.

[9]  E. L. Lawler,et al.  Branch-and-Bound Methods: A Survey , 1966, Oper. Res..

[10]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[11]  Dimitris Bertsimas,et al.  Restless Bandits, Linear Programming Relaxations, and a Primal-Dual Index Heuristic , 2000, Oper. Res..

[12]  J. Nio-Mora Restless Bandit Marginal Productivity Indices, Diminishing Returns, and Optimal Control of Make-to-Order/Make-to-Stock M/G/1 Queues , 2006 .

[13]  Hiroshi Wakuya,et al.  Bottom-up learning of hierarchical models in a class of deterministic POMDP environments , 2015, Int. J. Appl. Math. Comput. Sci..

[14]  J. Nino-Mora A Marginal Productivity Index Policy for the Finite-Horizon Multiarmed Bandit Problem , 2005, Proceedings of the 44th IEEE Conference on Decision and Control.

[15]  J. Niño-Mora RESTLESS BANDITS, PARTIAL CONSERVATION LAWS AND INDEXABILITY , 2001 .

[16]  Andreas Krause,et al.  Parallelizing Exploration-Exploitation Tradeoffs with Gaussian Process Bandit Optimization , 2012, ICML.

[17]  R. Bellman,et al.  Dynamic Programming and Markov Processes , 1960 .

[18]  Andreas Krause,et al.  Information-Theoretic Regret Bounds for Gaussian Process Optimization in the Bandit Setting , 2009, IEEE Transactions on Information Theory.

[19]  José Niño-Mora,et al.  Dynamic allocation indices for restless projects and queueing admission control: a polyhedral approach , 2002, Math. Program..

[20]  José Niño-Mora,et al.  Dynamic priority allocation via restless bandit marginal productivity indices , 2007, 2304.06115.

[21]  Andreas Krause,et al.  Contextual Gaussian Process Bandit Optimization , 2011, NIPS.

[22]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 1985 .

[23]  Dimitris Bertsimas,et al.  Conservation Laws, Extended Polymatroids and Multiarmed Bandit Problems; A Polyhedral Approach to Indexable Systems , 1996, Math. Oper. Res..

[24]  Andreas Krause,et al.  Bayesian optimization for maximum power point tracking in photovoltaic power plants , 2016, 2016 European Control Conference (ECC).