Reinforcement Learning with Immediate Rewards and Linear Hypotheses

Abstract We consider the design and analysis of algorithms that learn from the consequences of their actions with the goal of maximizing their cumulative reward, when the consequence of a given action is felt immediately, and a linear function, which is unknown a priori, (approximately) relates a feature vector for each action/state pair to the (expected) associated reward. We focus on two cases, one in which a continuous-valued reward is (approximately) given by applying the unknown linear function, and another in which the probability of receiving the larger of binary-valued rewards is obtained. For these cases we provide bounds on the per-trial regret for our algorithms that go to zero as the number of trials approaches infinity. We also provide lower bounds that show that the rate of convergence is nearly optimal.

[1]  P. Anandan,et al.  Pattern-recognizing stochastic learning automata , 1985, IEEE Transactions on Systems, Man, and Cybernetics.

[2]  Leslie G. Valiant,et al.  On the learnability of Boolean formulae , 1987, STOC.

[3]  N. Littlestone Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[4]  Kumpati S. Narendra,et al.  Learning automata - an introduction , 1989 .

[5]  David Haussler,et al.  How to use expert advice , 1993, STOC.

[6]  D. Haussler,et al.  Rigorous learning curve bounds from statistical mechanics , 1994, COLT '94.

[7]  Nicolò Cesa-Bianchi,et al.  Gambling in a rigged casino: The adversarial multi-armed bandit problem , 1995, Proceedings of IEEE 36th Annual Foundations of Computer Science.

[8]  Philip M. Long,et al.  Worst-case quadratic loss bounds for prediction using linear functions and gradient descent , 1996, IEEE Trans. Neural Networks.

[9]  Philip M. Long On-line evaluation and prediction using linear functions , 1997, COLT '97.

[10]  Claude-Nicolas Fiechter,et al.  Design and analysis of efficient reinforcement learning algorithms , 1997 .

[11]  Krishnan Rajagopalan,et al.  Goal-Oriented Multimedia Dialogue with Variable Initiative , 1997, ISMIS.

[12]  Naoki Abe,et al.  Learning to Optimally Schedule Internet Banner Advertisements , 1999, ICML.

[13]  Philip M. Long,et al.  Associative Reinforcement Learning using Linear Probabilistic Concepts , 1999, ICML.

[14]  Chamy Allenberg,et al.  Individual sequence prediction—upper bounds and application for complexity , 1999, COLT '99.

[15]  Peter Auer,et al.  Using upper confidence bounds for online learning , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[16]  Philip M. Long,et al.  Apple Tasting , 2000, Inf. Comput..

[17]  Peter Auer,et al.  Using Confidence Bounds for Exploitation-Exploration Trade-offs , 2003, J. Mach. Learn. Res..

[18]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[19]  Richard S. Sutton,et al.  Associative search network: A reinforcement learning associative memory , 1981, Biological Cybernetics.

[20]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[21]  R. Schapire,et al.  Toward efficient agnostic learning , 1992, COLT '92.

[22]  Leslie Pack Kaelbling,et al.  Associative Reinforcement Learning: Functions in k-DNF , 1994, Machine Learning.

[23]  L. Kaelbling Associative reinforcement learning: A generate and test algorithm , 2004, Machine Learning.

[24]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.