Advantage Amplification in Slowly Evolving Latent-State Environments

Latent-state environments with long horizons, such as those faced by recommender systems, pose significant challenges for reinforcement learning (RL). In this work, we identify and analyze several key hurdles for RL in such environments, including belief state error and small action advantage. We develop a general principle of advantage amplification that can overcome these hurdles through the use of temporal abstraction. We propose several aggregation methods and prove they induce amplification in certain settings. We also bound the loss in optimality incurred by our methods in environments where latent state evolves slowly and demonstrate their performance empirically in a stylized user-modeling task.

[1]  Craig Boutilier,et al.  Logistic Markov Decision Processes , 2017, IJCAI.

[3]  Xiaohui Ye,et al.  Horizon: Facebook's Open Source Applied Reinforcement Learning Platform , 2018, ArXiv.

[4]  Saeed Shiry Ghidary,et al.  Usage-based web recommendations: a reinforcement learning approach , 2007, RecSys '07.

[5]  Damien Ernst,et al.  On overfitting and asymptotic bias in batch reinforcement learning with partial observability , 2017, J. Artif. Intell. Res..

[6]  Vahab S. Mirrokni,et al.  Budget Optimization for Online Campaigns with Positive Carryover Effects , 2012, WINE.

[7]  Laurent Orseau,et al.  Reinforcement Learning with a Corrupted Reward Channel , 2017, IJCAI.

[8]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[9]  Doina Precup,et al.  Between MOPs and Semi-MOP: Learning, Planning & Representing Knowledge at Multiple Temporal Scales , 1998 .

[10]  Balaraman Ravindran,et al.  Learning to Repeat: Fine Grained Action Repetition for Deep Reinforcement Learning , 2017, ICLR.

[11]  Marc G. Bellemare,et al.  Increasing the Action Gap: New Operators for Reinforcement Learning , 2015, AAAI.

[12]  Philip S. Thomas,et al.  Personalized Ad Recommendation Systems for Life-Time Value Optimization with Guarantees , 2015, IJCAI.

[13]  Sridhar Mahadevan,et al.  Recent Advances in Hierarchical Reinforcement Learning , 2003, Discret. Event Dyn. Syst..

[14]  Guy Shani,et al.  An MDP-Based Recommender System , 2002, J. Mach. Learn. Res..

[15]  Ajith Ramanathan,et al.  Practical Diversified Recommendations on YouTube with Determinantal Point Processes , 2018, CIKM.

[16]  Xavier Boyen,et al.  Tractable Inference for Complex Stochastic Processes , 1998, UAI.

[17]  Michael I. Jordan,et al.  Learning Without State-Estimation in Partially Observable Markovian Decision Processes , 1994, ICML.

[18]  Milos Hauskrecht,et al.  Hierarchical Solution of Markov Decision Processes using Macro-actions , 1998, UAI.

[19]  David Veredas,et al.  Temporal Aggregation of Univariate and Multivariate Time Series Models: A Survey , 2008 .

[20]  Russ Tedrake,et al.  Signal-to-Noise Ratio Analysis of Policy Gradient Algorithms , 2008, NIPS.

[21]  Dale Schuurmans,et al.  Bridging the Gap Between Value and Policy Based Reinforcement Learning , 2017, NIPS.

[22]  Liang Zhang,et al.  Deep Reinforcement Learning for List-wise Recommendations , 2017, ArXiv.

[23]  Avraham Adler,et al.  Lambert-W Function , 2015 .

[24]  Jung-Woo Ha,et al.  Reinforcement Learning based Recommender System using Biclustering Technique , 2018, ArXiv.

[25]  Jan Peters,et al.  Hierarchical Relative Entropy Policy Search , 2014, AISTATS.

[26]  Diane Tang,et al.  Focusing on the Long-term: It's Good for Users and Business , 2015, KDD.

[27]  Danna Zhou,et al.  d. , 1934, Microbial pathogenesis.

[28]  Richard S. Sutton,et al.  Predictive Representations of State , 2001, NIPS.

[30]  Edward J. Sondik,et al.  The Optimal Control of Partially Observable Markov Processes over a Finite Horizon , 1973, Oper. Res..

[31]  Amir Massoud Farahmand,et al.  Action-Gap Phenomenon in Reinforcement Learning , 2011, NIPS.

[32]  Byron Boots,et al.  Predictive State Recurrent Neural Networks , 2017, NIPS.

[33]  Mohamad Y. Jaber,et al.  Learning and forgetting models and their applications , 2013 .

[34]  Ronald Parr,et al.  Flexible Decomposition Algorithms for Weakly Coupled Markov Decision Problems , 1998, UAI.

[35]  Joelle Pineau,et al.  Temporal Regularization in Markov Decision Process , 2018, ArXiv.

[36]  L. L. Thurstone,et al.  The learning curve equation , 1919 .

[37]  Roy Fox,et al.  Taming the Noise in Reinforcement Learning via Soft Updates , 2015, UAI.

[38]  L. Baird Reinforcement Learning Through Gradient Descent , 1999 .

[39]  Craig Boutilier,et al.  SlateQ: A Tractable Decomposition for Reinforcement Learning with Recommendation Sets , 2019, IJCAI.