Can Reinforcement Learning Always Provide the Best Policy

Reinforcement learning deals with how to find the best policy under uncertain environment to maximize some notion of long term reward. In sequential decision making, it is often expected that the best policy can be designed by choosing appropriate reward or penalty for each action. In this paper, we provide a counterexample to show that the best sequential decision rule can not be obtained by the choice of any reward function in the reinforcement learning framework. In fact, the best policy, namely, the randomized sequential probability ratio test, can only be learned via a rather unconventional formulation of the reinforcement learning. The implication to the design of classifier combining method is also discussed.

[1]  H. Vincent Poor,et al.  An Introduction to Signal Detection and Estimation , 1994, Springer Texts in Electrical Engineering.

[2]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[3]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[4]  Anthony Kuh,et al.  Temporal difference learning applied to sequential detection , 1997, IEEE Trans. Neural Networks.

[5]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[6]  Richard S. Sutton,et al.  Open Theoretical Questions in Reinforcement Learning , 1999, EuroCOLT.

[7]  J. Andel Sequential Analysis , 2022, The SAGE Encyclopedia of Research Design.

[8]  R. Khan,et al.  Sequential Tests of Statistical Hypotheses. , 1972 .

[9]  Iain Murray,et al.  Solution of a Toy Problem by Reinforcement Learning , 2006 .

[10]  J. G. Gander,et al.  An introduction to signal detection and estimation , 1990 .

[11]  Michael L. Littman,et al.  Algorithms for Sequential Decision Making , 1996 .

[12]  J. Wolfowitz,et al.  Optimum Character of the Sequential Probability Ratio Test , 1948 .