论文信息 - Distributed Policy Evaluation Under Multiple Behavior Strategies

Distributed Policy Evaluation Under Multiple Behavior Strategies

We apply diffusion strategies to develop a fully-distributed cooperative reinforcement learning algorithm in which agents in a network communicate only with their immediate neighbors to improve predictions about their environment. The algorithm can also be applied to off-policy learning, meaning that the agents can predict the response to a behavior different from the actual policies they are following. The proposed distributed strategy is efficient, with linear complexity in both computation time and memory footprint. We provide a mean-square-error performance analysis and establish convergence under constant step-size updates, which endow the network with continuous learning capabilities. The results show a clear gain from cooperation: when the individual agents can estimate the solution, cooperation increases stability and reduces bias and variance of the prediction error; but, more importantly, the network is able to approach the optimal solution even when none of the individual agents can (e.g., when the individual behavior policies restrict each agent to sample a small portion of the state space).

[1] Charles R. Johnson,et al. Matrix analysis , 1985, Statistical Inference for Engineers and Data Scientists.

[2] Ali H. Sayed,et al. Performance Limits for Distributed Estimation Over LMS Adaptive Networks , 2012, IEEE Transactions on Signal Processing.

[3] Byron Boots,et al. Predictive State Temporal Difference Learning , 2010, NIPS.

[4] Richard S. Sutton,et al. Temporal-difference search in computer Go , 2012, Machine Learning.

[5] Elizabeth L. Wilmer,et al. Markov Chains and Mixing Times , 2008 .

[6] Richard S. Sutton,et al. GQ(lambda): A general gradient algorithm for temporal-difference prediction learning with eligibility traces , 2010, Artificial General Intelligence.

[7] Ali H. Sayed,et al. Distributed Pareto Optimization via Diffusion Strategies , 2012, IEEE Journal of Selected Topics in Signal Processing.

[8] Lihong Li,et al. An analysis of linear models, linear value-function approximation, and feature selection for reinforcement learning , 2008, ICML '08.

[9] Soummya Kar,et al. Convergence Rate Analysis of Distributed Gossip (Linear Parameter) Estimation: Fundamental Limits and Tradeoffs , 2010, IEEE Journal of Selected Topics in Signal Processing.

[10] Shalabh Bhatnagar,et al. Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[11] Andrew Y. Ng,et al. Regularization and feature selection in least-squares temporal difference learning , 2009, ICML '09.

[12] V. Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint , 2008 .

[13] Ali H. Sayed,et al. Diffusion Least-Mean Squares Over Adaptive Networks: Formulation and Performance Analysis , 2008, IEEE Transactions on Signal Processing.

[14] Ali H. Sayed,et al. Diffusion Adaptation Strategies for Distributed Optimization and Learning Over Networks , 2011, IEEE Transactions on Signal Processing.

[15] J.N. Tsitsiklis,et al. Convergence in Multiagent Coordination, Consensus, and Flocking , 2005, Proceedings of the 44th IEEE Conference on Decision and Control.

[16] Sean P. Meyn,et al. The O.D.E. Method for Convergence of Stochastic Approximation and Reinforcement Learning , 2000, SIAM J. Control. Optim..

[17] E. Seneta. Non-negative Matrices and Markov Chains , 2008 .

[18] T Y Al Naffouri,et al. TRANSIENT ANALYSIS OF DATANORMALIZED ADAPTIVE FILTERS , 2003 .

[19] Santiago Zazo,et al. Diffusion gradient temporal difference for cooperative reinforcement learning with linear function approximation , 2012, 2012 3rd International Workshop on Cognitive Information Processing (CIP).

[20] O. Nelles,et al. An Introduction to Optimization , 1996, IEEE Antennas and Propagation Magazine.

[21] Sridhar Mahadevan,et al. Learning Representation and Control in Markov Decision Processes: New Frontiers , 2009, Found. Trends Mach. Learn..

[22] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[23] Csaba Szepesvári,et al. Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path , 2006, Machine Learning.

[24] Patrick M. Pilarski,et al. Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction , 2011, AAMAS.

[25] Thomas Degris,et al. Scaling-up Knowledge for a Cognizant Robot , 2012, AAAI Spring Symposium: Designing Intelligent Robots.

[26] Ali H. Sayed,et al. Diffusion LMS Strategies for Distributed Estimation , 2010, IEEE Transactions on Signal Processing.

[27] Bruno Scherrer,et al. Should one compute the Temporal Difference fix point or minimize the Bellman Residual? The unified oblique projection view , 2010, ICML.

[28] Leemon C. Baird,et al. Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[29] Dirk P. Kroese,et al. Handbook of Monte Carlo Methods , 2011 .

[30] John N. Tsitsiklis,et al. Analysis of temporal-difference learning with function approximation , 1996, NIPS 1996.

[31] Andrew W. Moore,et al. Distributed Value Functions , 1999, ICML.

[32] Ali H. Sayed,et al. Diffusion strategies for adaptation and learning over networks: an examination of distributed strategies and network behavior , 2013, IEEE Signal Processing Magazine.

[33] R. Sutton,et al. GQ(λ): A general gradient algorithm for temporal-difference prediction learning with eligibility traces , 2010 .

[34] Ali H. Sayed,et al. Cooperative off-policy prediction of Markov decision processes in adaptive networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[35] Shie Mannor,et al. Basis Function Adaptation in Temporal Difference Reinforcement Learning , 2005, Ann. Oper. Res..

[36] Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[37] Shalabh Bhatnagar,et al. Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation , 2009, NIPS.

[38] Gene H. Golub,et al. Numerical solution of saddle point problems , 2005, Acta Numerica.

[39] Shalabh Bhatnagar,et al. The Borkar-Meyn theorem for asynchronous stochastic approximations , 2011, Syst. Control. Lett..

[40] B. V. Dean,et al. Studies in Linear and Non-Linear Programming. , 1959 .

[41] Richard M. Murray,et al. Consensus problems in networks of agents with switching topology and time-delays , 2004, IEEE Transactions on Automatic Control.

[42] Leslie Pack Kaelbling,et al. Efficient Distributed Reinforcement Learning through Agreement , 2008, DARS.

[43] Asuman E. Ozdaglar,et al. Distributed Subgradient Methods for Multi-Agent Optimization , 2009, IEEE Transactions on Automatic Control.

[44] Ali H. Sayed,et al. On the limiting behavior of distributed optimization strategies , 2012, 2012 50th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[45] Matthieu Geist,et al. Parametric value function approximation: A unified view , 2011, 2011 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL).

[46] Dimitri P. Bertsekas,et al. Convergence Results for Some Temporal Difference Methods Based on Least Squares , 2009, IEEE Transactions on Automatic Control.

[47] Ali H. Sayed,et al. Asynchronous Adaptation and Learning Over Networks—Part II: Performance Analysis , 2013, IEEE Transactions on Signal Processing.

[48] John N. Tsitsiklis,et al. Distributed Asynchronous Deterministic and Stochastic Gradient Optimization Algorithms , 1984, 1984 American Control Conference.

[49] Srdjan S. Stankovic,et al. Decentralized Parameter Estimation by Consensus Based Stochastic Approximation , 2007, IEEE Transactions on Automatic Control.

[50] V. Climenhaga. Markov chains and mixing times , 2013 .

[51] Tareq Y. Al-Naffouri,et al. Transient analysis of data-normalized adaptive filters , 2003, IEEE Trans. Signal Process..

[52] Ali H. Sayed,et al. On the Learning Behavior of Adaptive Networks—Part I: Transient Analysis , 2013, IEEE Transactions on Information Theory.

[53] Ali H. Sayed,et al. Adaptive Networks , 2014, Proceedings of the IEEE.

[54] B. V. Dean,et al. Studies in Linear and Non-Linear Programming. , 1959 .

[55] Dimitri P. Bertsekas,et al. Basis function adaptation methods for cost approximation in MDP , 2009, 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning.

[56] Brian D. Ripley,et al. Stochastic Simulation , 2005 .

[57] B. V. Dean,et al. Studies in Linear and Non-Linear Programming. , 1959 .

[58] Matthew W. Hoffman,et al. Regularized Least Squares Temporal Difference Learning with Nested ℓ2 and ℓ1 Penalization , 2011, EWRL.

[59] Ali H. Sayed,et al. Diffusion Adaptation over Networks , 2012, ArXiv.

[60] Richard S. Sutton,et al. A Convergent O(n) Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation , 2008, NIPS.

[61] W. K. Hastings,et al. Monte Carlo Sampling Methods Using Markov Chains and Their Applications , 1970 .

[62] Marc G. Bellemare,et al. Sketch-Based Linear Value Function Approximation , 2012, NIPS.

[63] H. Vincent Poor,et al. QD-Learning: A Collaborative Distributed Strategy for Multi-Agent Reinforcement Learning Through Consensus + Innovations , 2012, IEEE Trans. Signal Process..

[64] Stephen P. Boyd,et al. Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[65] Ali H. Sayed,et al. Diffusion Strategies Outperform Consensus Strategies for Distributed Estimation Over Adaptive Networks , 2012, IEEE Transactions on Signal Processing.

[66] S. Haykin. Adaptive Filters , 2007 .

[67] Dimitri P. Bertsekas,et al. Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[68] Richard S. Sutton,et al. Multi-timescale nexting in a reinforcement learning robot , 2011, Adapt. Behav..