Reinforcement learning algorithms are typically geared towards optimizing the expected return of an agent. However, in many practical applications, low variance in the return is desired to ensure the reliability of an algorithm. In this paper, we propose on-policy and off-policy actor-critic algorithms that optimize a performance criterion involving both mean and variance in the return. Previous work uses the second moment of return to estimate the variance indirectly. Instead, we use a much simpler recently proposed direct variance estimator which updates the estimates incrementally using temporal difference methods. Using the variance-penalized criterion, we guarantee the convergence of our algorithm to locally optimal policies for finite state action Markov decision processes. We demonstrate the utility of our algorithm in tabular and continuous MuJoCo domains. Our approach not only performs on par with actor-critic and prior variance-penalization baselines in terms of expected return, but also generates trajectories which have lower variance in the return. Introduction Reinforcement learning (RL) agents learn to solve a task by optimizing the expected accumulated discounted rewards (return) in a conventional setting. However, in risk-sensitive applications like industrial automation, finance, medicine, or robotics, the standard objective of RL may not suffice, because it does not account for the variability induced by the return distribution. In this paper, we propose a technique that promotes learning of policies with less variability. Variability in sequential decision-making problems can arise from two sources – the inherent stochasticity in the environment (transition and reward), and imperfect knowledge about the model. The former source of variability is addressed by the risk-sensitive Markov decision processes (MDPs) (Howard and Matheson 1972; Heger 1994; Borkar 2001, 2002), whereas the latter is covered by robust MDPs (Iyengar 2005; Nilim and El Ghaoui 2005). In this work, we address the former source of variability in an RL setup via mean-variance optimization. One could account for meanvariance tradeoffs via maximization of the mean subject to variance constraints (solved using constrained MDPs (Altman 1999)), maximization of the Sharpe ratio (Sharpe Copyright © 2021, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. 1994), or incorporation of the variance as a penalty in the objective function (Filar, Kallenberg, and Lee 1989; White 1994). Here, we use a variance-penalized method to solve the optimization problem by adding a penalty term to the objective. There are two ways to compute the variance in the return Var(G). The indirect approach estimates Var(G) using the Bellman equation for both the first moment (i.e. value function) and the second moment as Var(G) = E[G] − E[G] (Sobel 1982). The direct approach forms a Bellman equation for the variance itself, as Var(G) = E[(G − E[G])] (Sherstan et al. 2018), skipping the calculation of the second moment. Sherstan et al. (2018) empirically established that in the policy evaluation setting, the direct variance estimation approach is better behaved compared to the indirect approach, in several scenarios: (a) when the value estimates are noisy, (b) when eligibility traces are used in the value estimation, and (c) when the variance in return is estimated from off-policy samples. Due to the above benefits and the simplicity of the direct approach, we build upon the approach proposed by Sherstan et al. (2018) only for policy evaluation setting, and, develop actor-critic algorithms for both onand off-policy settings (control). Contributions: (1) We modify the standard policy gradient objective to include a direct variance estimator for learning policies that maximize the variance-penalized return. (2) We develop a multi-timescale actor-critic algorithm, by deriving the gradient of the variance estimator in both the onpolicy and the off-policy case. (3) We prove convergence to locally optimal policies in the on-policy tabular setting. (4) We compare our proposed variance-penalized actor-critic (VPAC) algorithm with two baselines: actor-critic (AC) (Sutton et al. 2000; Konda and Tsitsiklis 2000), and an existing indirect variance penalized approach called varianceadjusted actor-critic (VAAC) (Tamar and Mannor 2013). We evaluate our onand off-policy VPAC algorithms in both discrete and continuous domains. The empirical findings demonstrate that VPAC compares favorably to both baselines in terms of the mean return, but generates trajectories with significantly lower variance in the return. Preliminaries Notation We consider an infinite-horizon discrete MDP 〈S,A,R, P, γ〉 with finite state space S and finite action ar X iv :2 10 2. 01 98 5v 1 [ cs .L G ] 3 F eb 2 02 1 space A. R ∈ R denotes the reward function (with Rt+1 denoting the reward at time t). A policy π : S → A governs the behavior of the agent in state s, the agent chooses an action a ∼ π(·|s), then transitions to next state s′ according to transition probability P (s′|s, a). γ ∈ [0, 1] is the discount factor. Let Gt = ∑∞ l=0 γ Rt+1+l denote the accumulated discounted reward (also known as return) along a trajectory. The state value function for π is defined as: Vπ(s) = Eπ[Gt|St = s] and state-action value function is: Qπ(s, a) = Eπ[Gt|St = s,At = a]. In this paper, Eπ[.] denotes expectation over transition function of MDP and probability distribution under π policy. Actor-Critic (AC) The policy gradient (PG) method (Sutton et al. 2000) is a policy optimization algorithm that performs gradient ascent in the direction maximizing the expected return. Given a parameterized policy πθ(a|s), where θ is the policy parameter, an initial state distribution d0 and the discounted weighting of states dπ(s) = ∑∞ t=0 γ P (St = s|s0 ∼ d0, π) encountered starting at some state s0, the gradient of the objective function Jd0(θ) = ∑ s0 d0(s0)Vπθ (s0) (Sutton and Barto 2018) is given by: ∇θJd0(θ) = Es0∼d0 [∑
[1]
V. Borkar.
Stochastic approximation with two time scales
,
1997
.
[2]
Javier García,et al.
A comprehensive survey on safe reinforcement learning
,
2015,
J. Mach. Learn. Res..
[3]
Vivek S. Borkar,et al.
Q-Learning for Risk-Sensitive Control
,
2002,
Math. Oper. Res..
[4]
W. Sharpe.
The Sharpe Ratio
,
1994
.
[5]
Sergey Levine,et al.
High-Dimensional Continuous Control Using Generalized Advantage Estimation
,
2015,
ICLR.
[6]
John N. Tsitsiklis,et al.
Neuro-Dynamic Programming
,
1996,
Encyclopedia of Machine Learning.
[7]
V. Borkar.
Stochastic Approximation: A Dynamical Systems Viewpoint
,
2008
.
[8]
Alec Radford,et al.
Proximal Policy Optimization Algorithms
,
2017,
ArXiv.
[9]
Marcello Restelli,et al.
Risk-Averse Trust Region Optimization for Reward-Volatility Reduction
,
2019,
IJCAI.
[10]
Marc G. Bellemare,et al.
A Distributional Perspective on Reinforcement Learning
,
2017,
ICML.
[11]
Vivek S. Borkar,et al.
A sensitivity formula for risk-sensitive cost and the actor-critic algorithm
,
2001,
Syst. Control. Lett..
[12]
Mohammad Ghavamzadeh,et al.
Algorithms for CVaR Optimization in MDPs
,
2014,
NIPS.
[13]
Richard S. Sutton,et al.
Reinforcement Learning: An Introduction
,
1998,
IEEE Trans. Neural Networks.
[14]
Vivek S. Borkar,et al.
A Learning Algorithm for Risk-Sensitive Cost
,
2008,
Math. Oper. Res..
[15]
Jerzy A. Filar,et al.
Variance-Penalized Markov Decision Processes
,
1989,
Math. Oper. Res..
[16]
E. Altman.
Constrained Markov Decision Processes
,
1999
.
[17]
D. Duffie,et al.
An Overview of Value at Risk
,
1997
.
[18]
Shie Mannor,et al.
Variance Adjusted Actor Critic Algorithms
,
2013,
ArXiv.