Proposal Approximate Dynamic Programming Using Bellman Residual Elimination

The overarching goal of the thesis is to devise new strategies for multi-agent planning and control problems, especially in the case where the agents are subject to random failures, maintenance needs, or other health management concerns, or in cases where the system model is not perfectly known. We argue that dynamic programming techniques, in particular Markov Decision Processes (MDPs), are a natural framework for addressing these planning problems, and present an MDP problem formulation for a persistent surveillance mission that incorporates stochastic fuel usage dynamics and the possibility for randomly-occurring failures into the planning process. We show that this problem formulation and its optimal policy lead to good mission performance in a number of realworld scenarios. Furthermore, an on-line, adaptive solution framework is developed that allows the planning system to improve its performance over time, even in the case where the true system model is uncertain or time-varying. Motivated by the difficulty of solving the persistent mission problem exactly when the number of agents becomes large, we then develop a new family of approximate dynamic programming algorithms, called Bellman Residual Elimination (BRE) methods, which can be employed to approximately solve large-scale MDPs. We analyze these methods and prove a number of desirable theoretical properties about them, including reduction to exact policy iteration under certain conditions. Finally, we apply these BRE methods to large-scale persistent surveillance problems and show that they yield good performance, and furthermore, that they can be successfully integrated into the adaptive planning framework.

[1]  Jonathan P. How,et al.  Approximate dynamic programming using Bellman residual elimination and Gaussian process regression , 2009, 2009 American Control Conference.

[2]  Jonathan P. How,et al.  Robust adaptive Markov Decision Processes in multi-vehicle applications , 2009, 2009 American Control Conference.

[3]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[4]  Dimitri P. Bertsekas,et al.  Neuro-Dynamic Programming , 2009, Encyclopedia of Optimization.

[5]  Jonathan P. How,et al.  Approximate dynamic programming using support vector regression , 2008, 2008 47th IEEE Conference on Decision and Control.

[6]  Masashi Sugiyama,et al.  Geodesic Gaussian kernels for value function approximation , 2008, Auton. Robots.

[7]  Jonathan P. How,et al.  Experimental Demonstration of Adaptive MDP-Based Planning with Model Uncertainty , 2008 .

[8]  Risto Miikkulainen,et al.  Online kernel selection for Bayesian reinforcement learning , 2008, ICML '08.

[9]  B. Bethke,et al.  Group health management of UAV teams with applications to persistent surveillance , 2008, 2008 American Control Conference.

[10]  Csaba Szepesvári,et al.  Finite-Time Bounds for Fitted Value Iteration , 2008, J. Mach. Learn. Res..

[11]  B. Bethke,et al.  Real-time indoor autonomous vehicle test environment , 2008, IEEE Control Systems.

[12]  Mikhail Belkin,et al.  Towards a theoretical foundation for Laplacian-based manifold methods , 2005, J. Comput. Syst. Sci..

[13]  B. Bethke Kernel-Based Reinforcement Learning Using Bellman Residual Elimination , 2008 .

[14]  Panos M. Pardalos,et al.  Advances in Cooperative Control and Optimization , 2008 .

[15]  Richard M. Murray,et al.  Recent Research in Cooperative Control of Multivehicle Systems , 2007 .

[16]  Jason Weston,et al.  Large-scale kernel machines , 2007 .

[17]  Jonathan P. How,et al.  Mission Health Management for 24/7 Persistent Surveillance Operations , 2007 .

[18]  Jonathan P. How,et al.  Embedding Health Management into Mission Tasking for UAV Teams , 2007, 2007 American Control Conference.

[19]  Masashi Sugiyama,et al.  Value Function Approximation on Non-Linear Manifolds for Robot Motor Control , 2007, Proceedings 2007 IEEE International Conference on Robotics and Automation.

[20]  Csaba Szepesvári,et al.  Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path , 2006, Machine Learning.

[21]  Mario J. Valenti Approximate dynamic programming with applications in multi-agent systems , 2007 .

[22]  Liming Xiang,et al.  Kernel-Based Reinforcement Learning , 2006, ICIC.

[23]  Daniel Polani,et al.  Least Squares SVM for Least Squares TD Learning , 2006, ECAI.

[24]  Sridhar Mahadevan,et al.  Value Function Approximation with Diffusion Wavelets and Laplacian Eigenfunctions , 2005, NIPS.

[25]  Carl E. Rasmussen,et al.  A Unifying View of Sparse Approximate Gaussian Process Regression , 2005, J. Mach. Learn. Res..

[26]  Sridhar Mahadevan,et al.  Proto-value functions: developmental reinforcement learning , 2005, ICML.

[27]  Shie Mannor,et al.  Reinforcement learning with Gaussian processes , 2005, ICML.

[28]  Jian-xiong Dong,et al.  Fast SVM training algorithm with decomposition on very large data sets , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[30]  Andrew G. Barto,et al.  Linear Least-Squares Algorithms for Temporal Difference Learning , 2005, Machine Learning.

[31]  Yaakov Engel,et al.  Algorithms and representations for reinforcement learning (עם תקציר בעברית, תכן ושער נוסף: אלגוריתמים וייצוגים ללמידה מחיזוקים.; אלגוריתמים וייצוגים ללמידה מחיזוקים.) , 2005 .

[32]  Anthony Widjaja,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2003, IEEE Transactions on Neural Networks.

[33]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[34]  Benjamin Van Roy,et al.  A Cost-Shaping LP for Bellman Error Minimization with Performance Guarantees , 2004, NIPS.

[35]  Bernhard Schölkopf,et al.  A tutorial on support vector regression , 2004, Stat. Comput..

[36]  Mikhail Belkin,et al.  Semi-Supervised Learning on Riemannian Manifolds , 2004, Machine Learning.

[37]  Justin A. Boyan,et al.  Technical Update: Least-Squares Temporal Difference Learning , 2002, Machine Learning.

[38]  Gerald Tesauro,et al.  Practical issues in temporal difference learning , 1992, Machine Learning.

[39]  William D. Smart Explicit Manifold Representations for Value-Function Approximation in Reinforcement Learning , 2004, ISAIM.

[40]  Carl E. Rasmussen,et al.  Gaussian Processes in Reinforcement Learning , 2003, NIPS.

[41]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[42]  George L. Nemhauser,et al.  Rerouting Aircraft for Airline Recovery , 2003, Transp. Sci..

[43]  Benjamin Van Roy,et al.  The Linear Programming Approach to Approximate Dynamic Programming , 2003, Oper. Res..

[44]  Eric Bonabeau,et al.  Control of UAV Swarms: What the Bugs Can Teach Us , 2003 .

[45]  H. Van Dyke Parunak,et al.  Swarming Coordination of Multiple UAV's for Collaborative Sensing , 2003 .

[46]  Shie Mannor,et al.  Bayes Meets Bellman: The Gaussian Process Approach to Temporal Difference Learning , 2003, ICML.

[47]  Charles Stark Optimization-Based Analysis of Collaborative Airport Arrival Planning , 2003 .

[48]  Dimitri P. Bertsekas,et al.  Least Squares Policy Evaluation Algorithms with Linear Function Approximation , 2003, Discret. Event Dyn. Syst..

[49]  Joseph C. Hartman,et al.  The series–parallel replacement problem , 2002 .

[50]  Benjamin Van Roy,et al.  The linear programming approach to approximate dynamic programming: theory and application , 2002 .

[51]  Benjamin Van Roy,et al.  Approximate Linear Programming for Average-Cost Dynamic Programming , 2002, NIPS.

[52]  Carl E. Rasmussen,et al.  Derivative Observations in Gaussian Process Models of Dynamic Systems , 2002, NIPS.

[53]  Felipe Cucker,et al.  On the mathematical foundations of learning , 2001 .

[54]  S. Sathiya Keerthi,et al.  Improvements to Platt's SMO Algorithm for SVM Classifier Design , 2001, Neural Computation.

[55]  Gunnar Rätsch,et al.  An introduction to kernel-based learning algorithms , 2001, IEEE Trans. Neural Networks.

[56]  Xin Wang,et al.  Batch Value Function Approximation via Support Vectors , 2001, NIPS.

[57]  Kristin P. Bennett,et al.  Support vector machines: hype or hallelujah? , 2000, SKDD.

[58]  Bernhard Schölkopf,et al.  New Support Vector Algorithms , 2000, Neural Computation.

[59]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[60]  Vladimir Vapnik,et al.  An overview of statistical learning theory , 1999, IEEE Trans. Neural Networks.

[61]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[62]  Federico Girosi,et al.  An Equivalence Between Sparse Approximation and Support Vector Machines , 1998, Neural Computation.

[63]  Alexander Gammerman,et al.  Ridge Regression Learning Algorithm in Dual Variables , 1998, ICML.

[64]  Simon Haykin,et al.  Neural Networks: A Comprehensive Foundation , 1998 .

[65]  Dimitris Bertsimas,et al.  The Air Traffic Flow Management Problem with Enroute Capacities , 1998, Oper. Res..

[66]  Ram Gopalan,et al.  The Aircraft Maintenance Routing Problem , 1998, Oper. Res..

[67]  Cynthia Barnhart,et al.  Integrated Airline Schedule Planning , 1998 .

[68]  Federico Girosi,et al.  An improved training algorithm for support vector machines , 1997, Neural Networks for Signal Processing VII. Proceedings of the 1997 IEEE Signal Processing Society Workshop.

[69]  George L. Nemhauser,et al.  The aircraft rotation problem , 1997, Ann. Oper. Res..

[70]  Bernhard Schölkopf,et al.  Support vector learning , 1997 .

[71]  S. Ioffe,et al.  Temporal Differences-Based Policy Iteration and Applications in Neuro-Dynamic Programming , 1996 .

[72]  Bernhard Schölkopf,et al.  Extracting Support Data for a Given Task , 1995, KDD.

[73]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[74]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[75]  Gerald Tesauro,et al.  Temporal difference learning and TD-Gammon , 1995, CACM.

[76]  Vladimir Vapnik,et al.  The Nature of Statistical Learning , 1995 .

[77]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[78]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[79]  G. Wahba Spline Models for Observational Data , 1990 .

[80]  Kurt Hornik,et al.  Multilayer feedforward networks are universal approximators , 1989, Neural Networks.

[81]  V. Borkar A convex analytic approach to Markov decision processes , 1988 .

[82]  齋藤 三郎,et al.  Theory of reproducing kernels and its applications , 1988 .

[83]  Saburou Saitoh,et al.  Theory of Reproducing Kernels and Its Applications , 1988 .

[84]  P. Schweitzer,et al.  Generalized polynomial approximations in Markovian decision processes , 1985 .

[85]  Richard S. Sutton,et al.  Temporal credit assignment in reinforcement learning , 1984 .

[86]  A. Hordijk,et al.  Linear Programming and Markov Decision Chains , 1979 .

[87]  G. Hardy,et al.  Ramanujan: Twelve Lectures on Subjects Suggested by His Life and Work , 1978 .

[88]  E. Denardo On Linear Programming in a Markov Decision Problem , 1970 .

[89]  J. Williamson Harmonic Analysis on Semigroups , 1967 .

[90]  A. S. Manne Linear Programming and Sequential Decisions , 1960 .

[91]  Arthur L. Samuel,et al.  Some Studies in Machine Learning Using the Game of Checkers , 1967, IBM J. Res. Dev..

[92]  R Bellman,et al.  On the Theory of Dynamic Programming. , 1952, Proceedings of the National Academy of Sciences of the United States of America.

[93]  N. Aronszajn Theory of Reproducing Kernels. , 1950 .

[94]  Claude E. Shannon,et al.  Programming a computer for playing chess , 1950 .