Partially observable Markov decision prob lems POMDPs recently received a lot of at tention in the reinforcement learning commu nity No attention however has been paid to Levin s universal search through program space LS which is theoretically optimal for a wide variety of search problems including many POMDPs Experiments in this pa per rst show that LS can solve partially ob servable mazes POMs involving many more states and obstacles than those solved by var ious previous authors here LS also can eas ily outperform Q learning We then note however that LS is not necessarily optimal for incremental learning problems where experience with previous problems may help to reduce future search costs For this rea son we introduce an adaptive extension of LS ALS which uses experience to increase probabilities of instructions occurring in suc cessful programs found by LS To deal with cases where ALS does not lead to long term performance improvement we use the recent technique of environment independent rein forcement acceleration EIRA as a safety belt EIRA currently is the only known method that guarantees a lifelong history of reward accelerations Experiments with ad ditional POMs demonstrate a ALS can dramatically reduce the search time con sumed by successive calls of LS b Addi tional signi cant speed ups can be obtained by combining ALS and EIRA INTRODUCTION Levin Search LS Unbeknownst to many machine learning researchers there exists a search algorithm with amazing theoretical properties for a broad class of search problems LS Levin Levin has the optimal order of computational complexity For instance suppose there is an algorithm that solves a certain type of maze task in O n steps where n is a positive integer representing the problem size Then universal LS will solve the same task in at most O n steps See Li and Vit anyi for an overview See Schmidhuber b for recent imple mentations applications Search through program space is relevant for POMDPs LS is a smart way of performing ex haustive search by optimally allocating time to pro grams computing solution candidates details in sec tion Since programs written in a general language can use memory to disambiguate environmental in puts LS is of potential interest for solving partially ob servable Markov decision problems POMDPs which received a lot of attention during recent years e g Jaakkola et al Kaelbling et al Ring McCallum Incremental extensions of LS LS by itself how ever is non incremental it does not use experience with previous tasks to speed up performance on new tasks Therefore it cannot immediately be used in typical incremental reinforcement learning scenarios where in case of success the system is given rein forcement a real number and tries to use that expe rience to maximize the sum of future reinforcements to be obtained during the remainder of system life There have been proposals of adaptive variants of LS that modify LS underlying probability distribution on pro gram space Solomono Schmidhuber b None of these however can guarantee that the lifelong history of probability modi cations will correspond to a lifelong history of reinforcement accelerations EIRA The problem above has been addressed re cently Schmidhuber At certain times in system life called checkpoints a novel technique called environment independent reinforcement accel eration EIRA invalidates certain modi cations of the system s policy the policy can be an arbitrary modi able algorithm mapping environmental inputs and internal states to outputs and new internal states such that all currently valid modi cations are justi ed in the following sense each still valid modi cation has been followed by long term performance speed up To measure speed at each checkpoint EIRA looks at the entire time interval that went by since the modi ca tion occurred To do this e ciently EIRA performs some backtracking the time required for backtrack ing is taken into account for measuring performance speed ups EIRA is general in the sense that it can be combined with your favorite learning or search algo rithm Essentially EIRA works as a safety belt where your favorite learning algorithm fails to improve things such that long term reinforcement intake speeds up see details in section Outline of paper Section describes LS details Section presents the heuristic adaptation method ALS a simple adaptive incremental extension of LS related to the linear reward inaction algorithm e g Kaelbling Section brie y reviews EIRA and shows how to combine it with ALS Sec tion presents results in an illustrative application involving a maze that has many more states and ob stacles than mazes solved by previous authors working on POMDPs we show how LS can solve partially ob servable maze tasks with huge state spaces and non trivial but low complexity solutions Q learning fails to solve such tasks Then we show that ALS can use previous experience to signi cantly reduce search time Finally we show that ALS augmented by EIRA can clearly outperform ALS by itself Section presents conclusions
[1]
Leonid A. Levin,et al.
Randomness Conservation Inequalities; Information and Independence in Mathematical Theories
,
1984,
Inf. Control..
[2]
Ray J. Solomonoff,et al.
The Application of Algorithmic Probability to Problems in Artificial Intelligence
,
1985,
UAI.
[3]
Osamu Watanabe,et al.
Kolmogorov Complexity and Computational Complexity
,
2012,
EATCS Monographs on Theoretical Computer Science.
[4]
Leslie Pack Kaelbling,et al.
Learning in embedded systems
,
1993
.
[5]
Andrew McCallum,et al.
Overcoming Incomplete Perception with Utile Distinction Memory
,
1993,
ICML.
[6]
Ming Li,et al.
An Introduction to Kolmogorov Complexity and Its Applications
,
2019,
Texts in Computer Science.
[7]
Michael L. Littman,et al.
Memoryless policies: theoretical limitations and practical results
,
1994
.
[8]
Michael I. Jordan,et al.
Reinforcement Learning Algorithm for Partially Observable Markov Decision Problems
,
1994,
NIPS.
[9]
Dave Cliff,et al.
Adding Temporary Memory to ZCS
,
1994,
Adapt. Behav..
[10]
Mark B. Ring.
Continual learning in reinforcement environments
,
1995,
GMD-Bericht.
[11]
Corso Elvezia.
Discovering Solutions with Low Kolmogorov Complexity and High Generalization Capability
,
1995
.
[12]
Corso Elvezia,et al.
Environment-independent Reinforcement Acceleration
,
1995
.
[13]
Andrew McCallum,et al.
Instance-Based Utile Distinctions for Reinforcement Learning with Hidden State
,
1995,
ICML.
[14]
Juergen Schmidhuber,et al.
A General Method For Incremental Self-Improvement And Multi-Agent Learning In Unrestricted Environme
,
1999
.
[15]
Leslie Pack Kaelbling,et al.
Planning and Acting in Partially Observable Stochastic Domains
,
1998,
Artif. Intell..