LEARNING ALGORITHMS FOR MARKOV DECISION PROCESSES

This study is concerned with finite Markov decision processes whose dynamics and reward structure are unknown but the state is observable exactly. We establish a learning algorithm which yields an optimal policy and construct an adaptive policy which is optimal under the average expected reward criterion.