Unobserved Is Not Equal to Non-existent: Using Gaussian Processes to Infer Immediate Rewards Across Contexts

Learning optimal policies in real-world domains with delayed rewards is a major challenge in Reinforcement Learning. We address the credit assignment problem by proposing a Gaussian Process (GP)-based immediate reward approximation algorithm and evaluate its effectiveness in 4 contexts where rewards can be delayed for long trajectories. In one GridWorld game and 8 Atari games, where immediate rewards are available, our results showed that on 7 out 9 games, the proposed GPinferred reward policy performed at least as well as the immediate reward policy and significantly outperformed the corresponding delayed reward policy. In e-learning and healthcare applications, we combined GP-inferred immediate rewards with offline Deep Q-Network (DQN) policy induction and showed that the GP-inferred reward policies outperformed the policies induced using delayed rewards in both real-world contexts.

[1]  Andrew Y. Ng,et al.  Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[2]  Richard K. Staley,et al.  From Example Study to Problem Solving: Smooth Transitions Help Learning , 2002 .

[3]  Yang Gao,et al.  Potential Based Reward Shaping for Hierarchical Reinforcement Learning , 2015, IJCAI.

[4]  Min Chi,et al.  Temporal Belief Memory: Imputing Missing Data during RNN Training , 2018, IJCAI.

[5]  Towards effective algorithms for linear groups , 2006 .

[6]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[7]  T. Brewin,et al.  Journal of Consulting and Clinical Psychology , 2002 .

[8]  Tom Schaul,et al.  Prioritized Experience Replay , 2015, ICLR.

[9]  Joshua B. Tenenbaum,et al.  Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation , 2016, NIPS.

[10]  Wojciech Zaremba,et al.  OpenAI Gym , 2016, ArXiv.

[11]  Scott Sanner,et al.  Reinforcement Learning with Multiple Experts: A Bayesian Model Combination Approach , 2018, NeurIPS.

[12]  L. Cronbach,et al.  Aptitudes and instructional methods: A handbook for research on interactions , 1977 .

[13]  C Alessandra Colaianni Terra Nova. , 2018, The New England journal of medicine.

[14]  Vincent Aleven,et al.  The worked-example effect: Not an artefact of lousy control conditions , 2009, Comput. Hum. Behav..

[15]  D. Nychka,et al.  A Multiresolution Gaussian Process Model for the Analysis of Large Spatial Datasets , 2015 .

[16]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[17]  Garrison W. Cottrell,et al.  Principled Methods for Advising Reinforcement Learning Agents , 2003, ICML.

[18]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[19]  D. Signorini,et al.  Neural networks , 1995, The Lancet.

[20]  R. Snow,et al.  Aptitude-treatment interaction as a framework for research on individual differences in psychotherapy. , 1991, Journal of consulting and clinical psychology.

[21]  R. Maitra,et al.  Supplement to “ A k-mean-directions Algorithm for Fast Clustering of Data on the Sphere ” published in the Journal of Computational and Graphical Statistics , 2009 .

[22]  M. V. Rossum,et al.  In Neural Computation , 2022 .

[23]  Peter Szolovits,et al.  Deep Reinforcement Learning for Sepsis Treatment , 2017, ArXiv.

[24]  Shie Mannor,et al.  Reinforcement learning with Gaussian processes , 2005, ICML.

[25]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[26]  Carl E. Rasmussen,et al.  Gaussian Processes in Reinforcement Learning , 2003, NIPS.

[27]  Shie Mannor,et al.  Bayes Meets Bellman: The Gaussian Process Approach to Temporal Difference Learning , 2003, ICML.

[28]  B. K. Panigrahi,et al.  ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE , 2010 .