Distributed Policy Evaluation Under Multiple Behavior Strategies

We apply diffusion strategies to develop a fully-distributed cooperative reinforcement learning algorithm in which agents in a network communicate only with their immediate neighbors to improve predictions about their environment. The algorithm can also be applied to off-policy learning, meaning that the agents can predict the response to a behavior different from the actual policies they are following. The proposed distributed strategy is efficient, with linear complexity in both computation time and memory footprint. We provide a mean-square-error performance analysis and establish convergence under constant step-size updates, which endow the network with continuous learning capabilities. The results show a clear gain from cooperation: when the individual agents can estimate the solution, cooperation increases stability and reduces bias and variance of the prediction error; but, more importantly, the network is able to approach the optimal solution even when none of the individual agents can (e.g., when the individual behavior policies restrict each agent to sample a small portion of the state space).

[1]  Charles R. Johnson,et al.  Matrix analysis , 1985, Statistical Inference for Engineers and Data Scientists.

[2]  Ali H. Sayed,et al.  Performance Limits for Distributed Estimation Over LMS Adaptive Networks , 2012, IEEE Transactions on Signal Processing.

[3]  Byron Boots,et al.  Predictive State Temporal Difference Learning , 2010, NIPS.

[4]  Richard S. Sutton,et al.  Temporal-difference search in computer Go , 2012, Machine Learning.

[5]  Elizabeth L. Wilmer,et al.  Markov Chains and Mixing Times , 2008 .

[6]  Richard S. Sutton,et al.  GQ(lambda): A general gradient algorithm for temporal-difference prediction learning with eligibility traces , 2010, Artificial General Intelligence.

[7]  Ali H. Sayed,et al.  Distributed Pareto Optimization via Diffusion Strategies , 2012, IEEE Journal of Selected Topics in Signal Processing.

[8]  Lihong Li,et al.  An analysis of linear models, linear value-function approximation, and feature selection for reinforcement learning , 2008, ICML '08.

[9]  Soummya Kar,et al.  Convergence Rate Analysis of Distributed Gossip (Linear Parameter) Estimation: Fundamental Limits and Tradeoffs , 2010, IEEE Journal of Selected Topics in Signal Processing.

[10]  Shalabh Bhatnagar,et al.  Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[11]  Andrew Y. Ng,et al.  Regularization and feature selection in least-squares temporal difference learning , 2009, ICML '09.

[12]  V. Borkar Stochastic Approximation: A Dynamical Systems Viewpoint , 2008 .

[13]  Ali H. Sayed,et al.  Diffusion Least-Mean Squares Over Adaptive Networks: Formulation and Performance Analysis , 2008, IEEE Transactions on Signal Processing.

[14]  Ali H. Sayed,et al.  Diffusion Adaptation Strategies for Distributed Optimization and Learning Over Networks , 2011, IEEE Transactions on Signal Processing.

[15]  J.N. Tsitsiklis,et al.  Convergence in Multiagent Coordination, Consensus, and Flocking , 2005, Proceedings of the 44th IEEE Conference on Decision and Control.

[16]  Sean P. Meyn,et al.  The O.D.E. Method for Convergence of Stochastic Approximation and Reinforcement Learning , 2000, SIAM J. Control. Optim..

[17]  E. Seneta Non-negative Matrices and Markov Chains , 2008 .

[18]  T Y Al Naffouri,et al.  TRANSIENT ANALYSIS OF DATANORMALIZED ADAPTIVE FILTERS , 2003 .

[19]  Santiago Zazo,et al.  Diffusion gradient temporal difference for cooperative reinforcement learning with linear function approximation , 2012, 2012 3rd International Workshop on Cognitive Information Processing (CIP).

[20]  O. Nelles,et al.  An Introduction to Optimization , 1996, IEEE Antennas and Propagation Magazine.

[21]  Sridhar Mahadevan,et al.  Learning Representation and Control in Markov Decision Processes: New Frontiers , 2009, Found. Trends Mach. Learn..

[22]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[23]  Csaba Szepesvári,et al.  Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path , 2006, Machine Learning.

[24]  Patrick M. Pilarski,et al.  Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction , 2011, AAMAS.

[25]  Thomas Degris,et al.  Scaling-up Knowledge for a Cognizant Robot , 2012, AAAI Spring Symposium: Designing Intelligent Robots.

[26]  Ali H. Sayed,et al.  Diffusion LMS Strategies for Distributed Estimation , 2010, IEEE Transactions on Signal Processing.

[27]  Bruno Scherrer,et al.  Should one compute the Temporal Difference fix point or minimize the Bellman Residual? The unified oblique projection view , 2010, ICML.

[28]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[29]  Dirk P. Kroese,et al.  Handbook of Monte Carlo Methods , 2011 .

[30]  John N. Tsitsiklis,et al.  Analysis of temporal-difference learning with function approximation , 1996, NIPS 1996.

[31]  Andrew W. Moore,et al.  Distributed Value Functions , 1999, ICML.

[32]  Ali H. Sayed,et al.  Diffusion strategies for adaptation and learning over networks: an examination of distributed strategies and network behavior , 2013, IEEE Signal Processing Magazine.

[33]  R. Sutton,et al.  GQ(λ): A general gradient algorithm for temporal-difference prediction learning with eligibility traces , 2010 .

[34]  Ali H. Sayed,et al.  Cooperative off-policy prediction of Markov decision processes in adaptive networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[35]  Shie Mannor,et al.  Basis Function Adaptation in Temporal Difference Reinforcement Learning , 2005, Ann. Oper. Res..

[36]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[37]  Shalabh Bhatnagar,et al.  Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation , 2009, NIPS.

[38]  Gene H. Golub,et al.  Numerical solution of saddle point problems , 2005, Acta Numerica.

[39]  Shalabh Bhatnagar,et al.  The Borkar-Meyn theorem for asynchronous stochastic approximations , 2011, Syst. Control. Lett..

[40]  B. V. Dean,et al.  Studies in Linear and Non-Linear Programming. , 1959 .

[41]  Richard M. Murray,et al.  Consensus problems in networks of agents with switching topology and time-delays , 2004, IEEE Transactions on Automatic Control.

[42]  Leslie Pack Kaelbling,et al.  Efficient Distributed Reinforcement Learning through Agreement , 2008, DARS.

[43]  Asuman E. Ozdaglar,et al.  Distributed Subgradient Methods for Multi-Agent Optimization , 2009, IEEE Transactions on Automatic Control.

[44]  Ali H. Sayed,et al.  On the limiting behavior of distributed optimization strategies , 2012, 2012 50th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[45]  Matthieu Geist,et al.  Parametric value function approximation: A unified view , 2011, 2011 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL).

[46]  Dimitri P. Bertsekas,et al.  Convergence Results for Some Temporal Difference Methods Based on Least Squares , 2009, IEEE Transactions on Automatic Control.

[47]  Ali H. Sayed,et al.  Asynchronous Adaptation and Learning Over Networks—Part II: Performance Analysis , 2013, IEEE Transactions on Signal Processing.

[48]  John N. Tsitsiklis,et al.  Distributed Asynchronous Deterministic and Stochastic Gradient Optimization Algorithms , 1984, 1984 American Control Conference.

[49]  Srdjan S. Stankovic,et al.  Decentralized Parameter Estimation by Consensus Based Stochastic Approximation , 2007, IEEE Transactions on Automatic Control.

[50]  V. Climenhaga Markov chains and mixing times , 2013 .

[51]  Tareq Y. Al-Naffouri,et al.  Transient analysis of data-normalized adaptive filters , 2003, IEEE Trans. Signal Process..

[52]  Ali H. Sayed,et al.  On the Learning Behavior of Adaptive Networks—Part I: Transient Analysis , 2013, IEEE Transactions on Information Theory.

[53]  Ali H. Sayed,et al.  Adaptive Networks , 2014, Proceedings of the IEEE.

[54]  B. V. Dean,et al.  Studies in Linear and Non-Linear Programming. , 1959 .

[55]  Dimitri P. Bertsekas,et al.  Basis function adaptation methods for cost approximation in MDP , 2009, 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning.

[56]  Brian D. Ripley,et al.  Stochastic Simulation , 2005 .

[57]  B. V. Dean,et al.  Studies in Linear and Non-Linear Programming. , 1959 .

[58]  Matthew W. Hoffman,et al.  Regularized Least Squares Temporal Difference Learning with Nested ℓ2 and ℓ1 Penalization , 2011, EWRL.

[59]  Ali H. Sayed,et al.  Diffusion Adaptation over Networks , 2012, ArXiv.

[60]  Richard S. Sutton,et al.  A Convergent O(n) Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation , 2008, NIPS.

[61]  W. K. Hastings,et al.  Monte Carlo Sampling Methods Using Markov Chains and Their Applications , 1970 .

[62]  Marc G. Bellemare,et al.  Sketch-Based Linear Value Function Approximation , 2012, NIPS.

[63]  H. Vincent Poor,et al.  QD-Learning: A Collaborative Distributed Strategy for Multi-Agent Reinforcement Learning Through Consensus + Innovations , 2012, IEEE Trans. Signal Process..

[64]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[65]  Ali H. Sayed,et al.  Diffusion Strategies Outperform Consensus Strategies for Distributed Estimation Over Adaptive Networks , 2012, IEEE Transactions on Signal Processing.

[66]  S. Haykin Adaptive Filters , 2007 .

[67]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[68]  Richard S. Sutton,et al.  Multi-timescale nexting in a reinforcement learning robot , 2011, Adapt. Behav..