The Cooperative Multi-agent Learning with Random Reward Values

This paper investigated how to learn the optimal action policies in cooperative multi-agent systems if the agents' rewards are random variables, and proposed a general two-stage learning algorithm for cooperative multi-(agent) decision processes. The algorithm first calculates the averaged immediate rewards, and considers these learned rewards as the agents' immediate action rewards to learn the optimal action policies. It is proved that the learning algorithm can find the optimal policies in stochastic environment. Extending the algorithm to stochastic Markov decision processes was also discussed.