Parallel Nonstationary Direct Policy Search for Risk-Averse Stochastic Optimization

This paper presents an algorithmic strategy to nonstationary policy search for finite-horizon, discrete-time Markovian decision problems with large state spaces, constrained action sets, and a risk-sensitive optimality criterion. The methodology relies on modeling time-variant policy parameters by a nonparametric response surface model for an indirect parametrized policy motivated by Bellman’s equation. The policy structure is heuristic when the optimization of the risk-sensitive criterion does not admit a dynamic programming reformulation. Through the interpolating approximation, the level of nonstationarity of the policy, and consequently, the size of the resulting search problem can be adjusted. The computational tractability and the generality of the approach follow from a nested parallel implementation of derivative-free optimization in conjunction with Monte Carlo simulation. We demonstrate the efficiency of the approach on an optimal energy storage charging problem, and illustrate the effect of the...

[1]  D K Smith,et al.  Numerical Optimization , 2001, J. Oper. Res. Soc..

[2]  V. Torczon,et al.  Direct search methods: then and now , 2000 .

[3]  Andrew W. Moore,et al.  Policy Search using Paired Comparisons , 2003, J. Mach. Learn. Res..

[4]  Philip S. Thomas,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation and Action-Dependent Baselines , 2017, ArXiv.

[5]  Dimitri P. Bertsekas,et al.  Error Bounds for Approximations from Projected Linear Equations , 2010, Math. Oper. Res..

[6]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[7]  Panos M. Pardalos,et al.  Approximate dynamic programming: solving the curses of dimensionality , 2009, Optim. Methods Softw..

[8]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[9]  Pablo A. Parrilo,et al.  Optimality of Affine Policies in Multistage Robust Optimization , 2009, Math. Oper. Res..

[10]  R. Rockafellar,et al.  Optimization of conditional value-at risk , 2000 .

[11]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[12]  T. Coleman,et al.  Reconstructing the Unknown Local Volatility Function , 1999 .

[13]  Bart De Schutter,et al.  Reinforcement Learning and Dynamic Programming Using Function Approximators , 2010 .

[14]  Jeff G. Schneider,et al.  Policy Search by Dynamic Programming , 2003, NIPS.

[15]  Donald R. Jones,et al.  A Taxonomy of Global Optimization Methods Based on Response Surfaces , 2001, J. Glob. Optim..

[16]  John N. Tsitsiklis,et al.  Parallel and distributed computation , 1989 .

[17]  Thomas F. Coleman,et al.  Smoothing and parametric rules for stochastic mean-CVaR optimal execution strategy , 2016, Ann. Oper. Res..

[18]  Nikolaos V. Sahinidis,et al.  Derivative-free optimization: a review of algorithms and comparison of software implementations , 2013, J. Glob. Optim..

[19]  Warren B. Powell,et al.  Tutorial on Stochastic Optimization in Energy—Part II: An Energy Storage Illustration , 2016, IEEE Transactions on Power Systems.

[20]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.

[21]  Dimitris Bertsimas,et al.  On the power and limitations of affine policies in two-stage adaptive optimization , 2012, Math. Program..

[22]  Jan Peters,et al.  Policy Search for Motor Primitives , 2009, Künstliche Intell..

[23]  D. Duffie,et al.  An Overview of Value at Risk , 1997 .

[24]  Tamara G. Kolda,et al.  Revisiting Asynchronous Parallel Pattern Search for Nonlinear Optimization , 2005, SIAM J. Optim..

[25]  Shie Mannor,et al.  The Cross Entropy Method for Fast Policy Search , 2003, ICML.

[26]  Daniel Kuhn,et al.  Primal and dual linear decision rules in stochastic and robust optimization , 2011, Math. Program..

[27]  Frank Riedel,et al.  Dynamic Coherent Risk Measures , 2003 .

[28]  Charles Audet,et al.  Analysis of Generalized Pattern Searches , 2000, SIAM J. Optim..

[29]  Shie Mannor,et al.  Automatic basis function construction for approximate dynamic programming and reinforcement learning , 2006, ICML.

[30]  Dan MacIsaac,et al.  Sustainable Energy — Without the hot air , 2009 .

[31]  Dimitri P. Bertsekas,et al.  Basis function adaptation methods for cost approximation in MDP , 2009, 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning.

[32]  Nicola Secomandi,et al.  An Approximate Dynamic Programming Approach to Benchmark Practice-Based Heuristics for Natural Gas Storage Valuation , 2010, Oper. Res..

[33]  Warrren B Powell,et al.  Mean-Conditional Value-at-Risk Optimal Energy Storage Operation in the Presence of Transaction Costs , 2015, IEEE Transactions on Power Systems.

[34]  Stefan Schaal,et al.  Natural Actor-Critic , 2003, Neurocomputing.

[35]  Stephen F. Smith,et al.  Managing Wind-Based Electricity Generation in the Presence of Storage and Transmission Capacity , 2018 .

[36]  Warren B. Powell,et al.  The Correlated Knowledge Gradient for Simulation Optimization of Continuous Parameters using Gaussian Process Regression , 2011, SIAM J. Optim..

[37]  L. Dixon,et al.  Parallel algorithms for global optimization , 1993 .

[38]  Jan Peters,et al.  Noname manuscript No. (will be inserted by the editor) Policy Search for Motor Primitives in Robotics , 2022 .

[39]  Warren B. Powell,et al.  Optimal Energy Commitments with Storage and Intermittent Supply , 2011, Oper. Res..

[40]  M. Dahleh,et al.  Optimal Management and Sizing of Energy Storage Under Dynamic Pricing for the Efficient Integration of Renewable Energy , 2015, IEEE Transactions on Power Systems.

[41]  Michael I. Jordan,et al.  PEGASUS: A policy search method for large MDPs and POMDPs , 2000, UAI.

[42]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[43]  V. Torczon,et al.  Why Pattern Search Works , 1998 .

[44]  John N. Tsitsiklis,et al.  Feature-based methods for large scale dynamic programming , 2004, Machine Learning.

[45]  Louis Wehenkel,et al.  Risk-aware decision making and dynamic programming , 2008 .

[46]  M. Matos,et al.  Optimization of Pumped Storage Capacity in an Isolated Power System With Large Renewable Penetration , 2008, IEEE Transactions on Power Systems.

[47]  Pablo A. Parrilo,et al.  Optimality of affine policies in multi-stage robust optimization , 2009, Proceedings of the 48h IEEE Conference on Decision and Control (CDC) held jointly with 2009 28th Chinese Control Conference.

[48]  T. Coleman,et al.  Total risk minimization using Monte-Carlo simulations , 2005 .

[49]  Darwin G. Caldwell,et al.  Direct policy search reinforcement learning based on particle filtering , 2012, EWRL 2012.

[50]  Warren B. Powell,et al.  SMART: A Stochastic Multiscale Model for the Analysis of Energy Resources, Technology, and Policy , 2012, INFORMS J. Comput..

[51]  O. SIAMJ.,et al.  ON THE CONVERGENCE OF PATTERN SEARCH ALGORITHMS , 1997 .

[52]  Christine A. Shoemaker,et al.  Applying Experimental Design and Regression Splines to High-Dimensional Continuous-State Stochastic Dynamic Programming , 1999, Oper. Res..

[53]  Mark Baker,et al.  Nested parallelism for multi-core HPC systems using Java , 2009, J. Parallel Distributed Comput..

[54]  Eiki Yamakawa,et al.  A BLOCK-PARALLEL CONJUGATE GRADIENT METHOD FOR SEPARABLE QUADRATIC PROGRAMMING PROBLEMS^1 , 1996 .

[55]  Robert Michael Lewis,et al.  On the Local Convergence of Pattern Search , 2003, SIAM J. Optim..

[56]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[57]  Gene H. Golub,et al.  Scientific computing: an introduction with parallel computing , 1993 .

[58]  Tamara G. Kolda,et al.  Asynchronous Parallel Pattern Search for Nonlinear Optimization , 2001, SIAM J. Sci. Comput..

[59]  R. Carmona,et al.  Valuation of energy storage: an optimal switching approach , 2010 .

[60]  Alexander Shapiro Time consistency of dynamic risk measures , 2012, Oper. Res. Lett..

[61]  Robert Babuska,et al.  A Survey of Actor-Critic Reinforcement Learning: Standard and Natural Policy Gradients , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[62]  Warren B. Powell,et al.  A comparison of approximate dynamic programming techniques on benchmark energy storage problems: Does anything work? , 2014, 2014 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL).

[63]  Andrea Castelletti,et al.  Curses, Tradeoffs, and Scalable Management: Advancing Evolutionary Multiobjective Direct Policy Search to Improve Water Reservoir Operations , 2016 .

[64]  Rémi Munos,et al.  Error Bounds for Approximate Policy Iteration , 2003, ICML.

[65]  Peter L. Bartlett,et al.  Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[66]  Ying Li,et al.  Numerical Solution of Continuous-State Dynamic Programs Using Linear and Spline Interpolation , 1993, Oper. Res..

[67]  Warren B. Powell,et al.  Clearing the Jungle of Stochastic Optimization , 2014 .

[68]  Alexandre Street,et al.  Time consistency and risk averse dynamic decision models: Definition, interpretation and practical consequences , 2014, Eur. J. Oper. Res..

[69]  Y. Censor,et al.  Parallel Optimization: Theory, Algorithms, and Applications , 1997 .

[70]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[71]  Andrzej Stachurski,et al.  Parallel Optimization: Theory, Algorithms and Applications , 2000, Parallel Distributed Comput. Pract..

[72]  A.M. Gonzalez,et al.  Stochastic Joint Optimization of Wind Generation and Pumped-Storage Units in an Electricity Market , 2008, IEEE Transactions on Power Systems.