Universal Reinforcement Learning

We consider an agent interacting with an unmodeled environment. At each time, the agent makes an observation, takes an action, and incurs a cost. Its actions can influence future observations and costs. The goal is to minimize the long-term average cost. We propose a novel algorithm, known as the active LZ algorithm, for optimal control based on ideas from the Lempel-Ziv scheme for universal data compression and prediction. We establish that, under the active LZ algorithm, if there exists an integer K such that the future is conditionally independent of the past given a window of K consecutive actions and observations, then the average cost converges to the optimum. Experimental results involving the game of Rock-Paper-Scissors illustrate merits of the algorithm.

[1]  J. W. Nieuwenhuis,et al.  Boekbespreking van D.P. Bertsekas (ed.), Dynamic programming and optimal control - volume 2 , 1999 .

[2]  Nimrod Megiddo,et al.  Combining expert advice in reactive environments , 2006, JACM.

[3]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[4]  Neri Merhav,et al.  Universal Filtering Via Prediction , 2007, IEEE Transactions on Information Theory.

[5]  Demosthenis Teneketzis,et al.  On the Structure of Optimal Real-Time Encoders and Decoders in Noisy Communication , 2006, IEEE Transactions on Information Theory.

[6]  Amir Dembo,et al.  Large Deviations Techniques and Applications , 1998 .

[7]  Neri Merhav,et al.  Universal prediction of individual sequences , 1992, IEEE Trans. Inf. Theory.

[8]  Neri Merhav,et al.  Universal Prediction , 1998, IEEE Trans. Inf. Theory.

[9]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[10]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[11]  P. Krishnan,et al.  Optimal prefetching via data compression , 1991, [1991] Proceedings 32nd Annual Symposium of Foundations of Computer Science.

[12]  Neri Merhav,et al.  On sequential strategies for loss functions with memory , 2002, IEEE Trans. Inf. Theory.

[13]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[14]  KearnsMichael,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 2002 .

[15]  Yishay Mansour,et al.  Reinforcement Learning in POMDPs Without Resets , 2005, IJCAI.

[16]  Raphail E. Krichevsky,et al.  The performance of universal encoding , 1981, IEEE Trans. Inf. Theory.

[17]  Philippe Jacquet,et al.  A universal predictor based on pattern matching , 2002, IEEE Trans. Inf. Theory.

[18]  Frans M. J. Willems,et al.  The context-tree weighting method: basic properties , 1995, IEEE Trans. Inf. Theory.