On-line Reinforcement Learning for Nonlinear Motion Control: Quadratic and Non-Quadratic Reward Functions

Abstract Reinforcement learning (RL) is an active research area with applications in many fields. RL can be used to learn control strategies for nonlinear dynamic systems, without a mathematical model of the system being required. An essential element in RL is the reward function, which shows resemblance to the cost function in optimal control. Analogous to linear quadratic (LQ) control, a quadratic reward function has been applied in RL. However, there is no analysis or motivation in the literature, other than the parallel to LQ control. This paper shows that the use of a quadratic reward function in on-line RL may lead to counter-intuitive results in terms of a large steady-state error. Although the RL controller learns well, the final performance is not acceptable from a control-theoretic point of view. The reasons for this discrepancy are analyzed and the results are compared with non-quadratic functions (absolute value and square root) using a model learning actor-critic with local linear regression. One of the conclusions is that the absolute-value reward function reduces the steady-state error considerably, while the learning time is slightly longer than with the quadratic reward.