Quantile Reinforcement Learning

In reinforcement learning, the standard criterion to evaluate policies in a state is the expectation of (discounted) sum of rewards. However, this criterion may not always be suitable, we consider an alternative criterion based on the notion of quantiles. In the case of episodic reinforcement learning problems, we propose an algorithm based on stochastic approximation with two timescales. We evaluate our proposition on a simple model of the TV show, Who wants to be a millionaire.

[1]  Phhilippe Jorion Value at Risk: The New Benchmark for Managing Financial Risk , 2000 .

[2]  Justo Puerto,et al.  Dynamic programming analysis of the TV game "Who wants to be a millionaire?" , 2007, Eur. J. Oper. Res..

[3]  Shie Mannor,et al.  Percentile optimization in uncertain Markov decision processes with application to efficient exploration , 2007, ICML '07.

[4]  Peter C. Fishburn,et al.  An axiomatic characterization of skew-symmetric bilinear functionals, with applications to utility theory , 1981 .

[5]  Pieter Abbeel,et al.  Autonomous Helicopter Aerobatics through Apprenticeship Learning , 2010, Int. J. Robotics Res..

[6]  Olivier Spanjaard,et al.  Solving MDPs with Skew Symmetric Bilinear Utility Functions , 2015, IJCAI.

[7]  E. McClennen Rationality and Dynamic Choice: Foundational Explorations , 1996 .

[8]  Paul Weng Ordinal Decision Models for Markov Decision Processes , 2012, ECAI.

[9]  Eyke Hüllermeier,et al.  Preference-based reinforcement learning: evolutionary direct policy search using a preference-based racing algorithm , 2014, Machine Learning.

[10]  Dirk Van den Poel,et al.  Benefits of quantile regression for the analysis of customer lifetime value in a contractual setting: An application in financial services , 2009, Expert Syst. Appl..

[11]  M. Rostek Quantile Maximization in Decision Theory , 2009 .

[12]  Véronique Bruyère,et al.  Meet Your Expectations With Guarantees: Beyond Worst-Case Synthesis in Quantitative Games , 2013, STACS.

[13]  V. Borkar Stochastic Approximation: A Dynamical Systems Viewpoint , 2008 .

[14]  Vivek S. Borkar,et al.  A Learning Scheme for Blackwell ’ s Approachability in MDPs and Stackelberg Stochastic Games , 2014 .

[15]  Gerald Tesauro,et al.  Temporal Difference Learning and TD-Gammon , 1995, J. Int. Comput. Games Assoc..

[16]  Vivek S. Borkar,et al.  Risk-constrained Markov decision processes , 2010, 49th IEEE Conference on Decision and Control (CDC).

[17]  Marc Schoenauer,et al.  Preference-based Reinforcement Learning , 2011 .

[18]  Paolo Viappiani,et al.  Model-Free Reinforcement Learning with Skew-Symmetric Bilinear Utilities , 2016, UAI.

[19]  Gildas Jeantet,et al.  Resolute Choice in Sequential Decision Problems with Multiple Priors , 2011, IJCAI.

[20]  Nicole Bäuerle,et al.  Markov Decision Processes with Average-Value-at-Risk criteria , 2011, Math. Methods Oper. Res..

[21]  Eyke Hüllermeier,et al.  Qualitative Multi-Armed Bandits: A Quantile-Based Approach , 2015, ICML.

[22]  Michèle Sebag,et al.  APRIL: Active Preference-learning based Reinforcement Learning , 2012, ECML/PKDD.

[23]  Sven Koenig,et al.  Functional Value Iteration for Decision-Theoretic Planning with General Utility Functions , 2006, AAAI.

[24]  Stella X. Yu,et al.  Optimization Models for the First Arrival Target Distribution Function in Discrete Time , 1998 .

[25]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[26]  Hugo Gimbert,et al.  Pure Stationary Optimal Strategies in Markov Decision Processes , 2007, STACS.

[27]  D. Blackwell An analog of the minimax theorem for vector payoffs. , 1956 .

[28]  Jerzy A. Filar,et al.  Variance-Penalized Markov Decision Processes , 1989, Math. Oper. Res..

[29]  Mohammad Ghavamzadeh,et al.  Algorithms for CVaR Optimization in MDPs , 2014, NIPS.

[30]  Richard Wolski,et al.  QPRED: Using Quantile Predictions to Improve Power Usage for Private Clouds , 2017, 2017 IEEE 10th International Conference on Cloud Computing (CLOUD).

[31]  D. J. White Utility, probabilistic constraints, mean and variance of discounted rewards in Markov decision processes , 1987 .

[32]  Eyke Hüllermeier,et al.  Preference-based reinforcement learning: a formal framework and a policy iteration algorithm , 2012, Mach. Learn..

[33]  Baining Guo,et al.  Spoken dialogue management as planning and acting under uncertainty , 2001, INTERSPEECH.

[34]  Jean-Yves Jaffray Implementing Resolute Choice Under Uncertainty , 1998, UAI.

[35]  Jerzy A. Filar,et al.  Percentiles and markovian decision processes , 1983 .

[36]  Andrew Y. Ng,et al.  Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[37]  Paul Weng,et al.  Markov Decision Processes with Ordinal Rewards: Reference Point-Based Preferences , 2011, ICAPS.

[38]  Bruno Zanuttini,et al.  Interactive Value Iteration for Markov Decision Processes with Unknown Rewards , 2013, IJCAI.

[39]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[40]  Sven Koenig,et al.  Risk-Sensitive Planning with One-Switch Utility Functions: Value Iteration , 2005, AAAI.

[41]  Craig Boutilier,et al.  Regret-based Reward Elicitation for Markov Decision Processes , 2009, UAI.

[42]  Michal Valko,et al.  Extreme bandits , 2014, NIPS.

[43]  Mickael Randour,et al.  Percentile queries in multi-dimensional Markov decision processes , 2014, CAV.

[44]  Jia Yuan Yu,et al.  Sample Complexity of Risk-Averse Bandit-Arm Selection , 2013, IJCAI.