论文信息 - Universal Reinforcement Learning

Universal Reinforcement Learning

We consider an agent interacting with an unmodeled environment. At each time, the agent makes an observation, takes an action, and incurs a cost. Its actions can influence future observations and costs. The goal is to minimize the long-term average cost. We propose a novel algorithm, known as the active LZ algorithm, for optimal control based on ideas from the Lempel-Ziv scheme for universal data compression and prediction. We establish that, under the active LZ algorithm, if there exists an integer K such that the future is conditionally independent of the past given a window of K consecutive actions and observations, then the average cost converges to the optimum. Experimental results involving the game of Rock-Paper-Scissors illustrate merits of the algorithm.

[1] J. W. Nieuwenhuis,et al. Boekbespreking van D.P. Bertsekas (ed.), Dynamic programming and optimal control - volume 2 , 1999 .

[2] Nimrod Megiddo,et al. Combining expert advice in reactive environments , 2006, JACM.

[3] Dimitri P. Bertsekas,et al. Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[4] Neri Merhav,et al. Universal Filtering Via Prediction , 2007, IEEE Transactions on Information Theory.

[5] Demosthenis Teneketzis,et al. On the Structure of Optimal Real-Time Encoders and Decoders in Noisy Communication , 2006, IEEE Transactions on Information Theory.

[6] Amir Dembo,et al. Large Deviations Techniques and Applications , 1998 .

[7] Neri Merhav,et al. Universal prediction of individual sequences , 1992, IEEE Trans. Inf. Theory.

[8] Neri Merhav,et al. Universal Prediction , 1998, IEEE Trans. Inf. Theory.

[9] Michael Kearns,et al. Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[10] Abraham Lempel,et al. Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[11] P. Krishnan,et al. Optimal prefetching via data compression , 1991, [1991] Proceedings 32nd Annual Symposium of Foundations of Computer Science.

[12] Neri Merhav,et al. On sequential strategies for loss functions with memory , 2002, IEEE Trans. Inf. Theory.

[13] Thomas M. Cover,et al. Elements of Information Theory , 2005 .

[14] KearnsMichael,et al. Near-Optimal Reinforcement Learning in Polynomial Time , 2002 .

[15] Yishay Mansour,et al. Reinforcement Learning in POMDPs Without Resets , 2005, IJCAI.

[16] Raphail E. Krichevsky,et al. The performance of universal encoding , 1981, IEEE Trans. Inf. Theory.

[17] Philippe Jacquet,et al. A universal predictor based on pattern matching , 2002, IEEE Trans. Inf. Theory.

[18] Frans M. J. Willems,et al. The context-tree weighting method: basic properties , 1995, IEEE Trans. Inf. Theory.