论文信息 - Multi-objective decision-theoretic planning

Multi-objective decision-theoretic planning

In most research on decision-theoretic agents, the desirability of actions and their effects is codified in a scalar reward function. However, many real-world decision problems have multiple objectives. In such cases the problem is more naturally expressed using a vector-valued reward function. Rather than having a single optimal policy, we then want to produce a set of policies that covers all possible preferences between the objectives. We call such a set a coverage set. In this dissertation, we focus on decision-theoretic planning algorithms that produce the Convex Coverage Set (CCS), which is the optimal solution set when either: 1) the user utility can be expressed as a weighted sum over the values for each objective; or 2) policies can be stochastic. We propose new methods based on two popular approaches to creating planning algorithms that produce an (approximate) CCS by building on an existing single-objective algorithm. In the inner loop approach, we replace the summations and maximizations in the inner most loops of the single-objective algorithm by cross-sums and pruning operations. In the outer loop approach, we solve a multi-objective problem as a series of scalarized problems by employing the single-objective method as a subroutine. Our most important contribution is an outer loop framework that we call Optimistic Linear Support (OLS). As an outer loop method OLS builds the CCS incrementally. We show that, contrary to existing outer loop methods, each intermediate result is a bounded approximation of the CCS with known bounds (even when the single-objective method used is a bounded approximate method as well) and is guaranteed to terminate in a finite number of iterations. We apply OLS-based algorithms to a variety of multi-objective decision problems, and show that it is more memory-efficient, and faster than corresponding inner loop algorithms for moderate numbers of objectives.

Diederik M. Roijers

[1] R. Bellman. A Markovian Decision Process , 1957 .

[2] A. S. Manne. Linear Programming and Sequential Decisions , 1960 .

[3] Karl Johan Åström,et al. Optimal control of Markov processes with incomplete state information , 1965 .

[4] A. Sen,et al. Collective Choice and Social Welfare , 2017 .

[5] P. McMullen. The maximum numbers of faces of a convex polytope , 1970 .

[6] E. J. Sondik,et al. The Optimal Control of Partially Observable Markov Decision Processes. , 1971 .

[7] Ronald L. Graham,et al. An Efficient Algorithm for Determining the Convex Hull of a Finite Planar Set , 1972, Inf. Process. Lett..

[8] Ray A. Jarvis,et al. On the Identification of the Convex Hull of a Finite Set of Points in the Plane , 1973, Inf. Process. Lett..

[9] Arnon Rosenthal. Nonserial dynamic programming is optimal , 1977, STOC '77.

[10] G. Monahan. State of the Art—A Survey of Partially Observable Markov Decision Processes: Theory, Models, and Algorithms , 1982 .

[11] D. White. Multi-objective infinite-horizon discounted Markov decision processes , 1982 .

[12] Stefan Arnborg,et al. Efficient algorithms for combinatorial problems on graphs with bounded decomposability — A survey , 1985, BIT.

[13] Judea Pearl,et al. Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[14] Hsien-Te Cheng,et al. Algorithms for partially observable markov decision processes , 1989 .

[15] Leslie Pack Kaelbling,et al. Acting Optimally in Partially Observable Stochastic Domains , 1994, AAAI.

[16] Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[17] Peter Norvig,et al. Artificial Intelligence: A Modern Approach , 1995 .

[18] Daniel P. Miranker,et al. On the Space-Time Trade-off in Solving Constraint Satisfaction Problems , 1995, IJCAI.

[19] Arthur C. Graesser,et al. Is it an Agent, or Just a Program?: A Taxonomy for Autonomous Agents , 1996, ATAL.

[20] Anders R. Kristensen,et al. Dynamic programming and Markov decision processes , 1996 .

[21] Rina Dechter,et al. Bucket elimination: A unifying framework for probabilistic inference , 1996, UAI.

[22] Craig Boutilier,et al. Planning, Learning and Coordination in Multiagent Decision Processes , 1996, TARK.

[23] Robert T. Clemen,et al. Making Hard Decisions: An Introduction to Decision Analysis , 1997 .

[24] Michael L. Littman,et al. Incremental Pruning: A Simple, Fast, Exact Method for Partially Observable Markov Decision Processes , 1997, UAI.

[25] G. Ziegler,et al. Basic properties of convex polytopes , 1997 .

[26] John N. Tsitsiklis,et al. Introduction to linear optimization , 1997, Athena scientific optimization and computation series.

[27] Leslie Pack Kaelbling,et al. Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[28] Richard S. Sutton,et al. Introduction to Reinforcement Learning , 1998 .

[29] A. Cassandra,et al. Exact and approximate algorithms for partially observable markov decision processes , 1998 .

[30] S. Mahadevan,et al. Solving Semi-Markov Decision Problems Using Average Reward Reinforcement Learning , 1999 .

[31] Craig Boutilier,et al. Decision-Theoretic Planning: Structural Assumptions and Computational Leverage , 1999, J. Artif. Intell. Res..

[32] Jesse Hoey,et al. SPUDD: Stochastic Planning using Decision Diagrams , 1999, UAI.

[33] Neil Immerman,et al. The Complexity of Decentralized Control of Markov Decision Processes , 2000, UAI.

[34] Luc Devroye,et al. Estimating the number of vertices of a polyhedron , 2000, Inf. Process. Lett..

[35] Milos Hauskrecht,et al. Value-Function Approximations for Partially Observable Markov Decision Processes , 2000, J. Artif. Intell. Res..

[36] Eitan Altman,et al. Applications of Markov Decision Processes in Communication Networks , 2000 .

[37] Carlos Guestrin,et al. Multiagent Planning with Factored MDPs , 2001, NIPS.

[38] Anne Condon,et al. On the undecidability of probabilistic planning and related stochastic optimization problems , 2003, Artif. Intell..

[39] Marc E. Pfetsch,et al. Some Algorithmic Problems in Polytope Theory , 2003, Algebra, Geometry, and Software Systems.

[40] Marco Laumanns,et al. Performance assessment of multiobjective optimizers: an analysis and review , 2003, IEEE Trans. Evol. Comput..

[41] Claudia V. Goldman,et al. Transition-independent decentralized markov decision processes , 2003, AAMAS '03.

[42] Nikos A. Vlassis,et al. Sparse cooperative Q-learning , 2004, ICML.

[43] Nikos A. Vlassis,et al. Anytime algorithms for multiagent decision making using coordination graphs , 2004, 2004 IEEE International Conference on Systems, Man and Cybernetics (IEEE Cat. No.04CH37583).

[44] Shlomo Zilberstein,et al. Region-Based Incremental Pruning for POMDPs , 2004, UAI.

[45] Patrice Perny,et al. GAI Networks for Utility Elicitation , 2004, KR.

[46] Sean R Eddy,et al. What is dynamic programming? , 2004, Nature Biotechnology.

[47] Nikos A. Vlassis,et al. Perseus: Randomized Point-based Value Iteration for POMDPs , 2005, J. Artif. Intell. Res..

[48] Makoto Yokoo,et al. Networked Distributed POMDPs: A Synergy of Distributed Constraint Optimization and POMDPs , 2005, IJCAI.

[49] Rina Dechter,et al. The Relationship Between AND/OR Search and Variable Elimination , 2005, UAI.

[50] I. Y. Kim,et al. Adaptive weighted-sum method for bi-objective optimization: Pareto front generation , 2005 .

[51] David Furcy,et al. Limited Discrepancy Beam Search , 2005, IJCAI.

[52] John K. Slaney,et al. Decision-Theoretic Planning with non-Markovian Rewards , 2011, J. Artif. Intell. Res..

[53] Javier Larrosa,et al. Bucket elimination for multiobjective optimization problems , 2006, J. Heuristics.

[54] Nikos A. Vlassis,et al. Collaborative Multiagent Reinforcement Learning by Payoff Propagation , 2006, J. Mach. Learn. Res..

[55] D. Bergemann,et al. Efficient Dynamic Auctions , 2006 .

[56] Joelle Pineau,et al. Anytime Point-Based Approximations for Large POMDPs , 2006, J. Artif. Intell. Res..

[57] László Monostori,et al. Agent-based systems for manufacturing , 2006 .

[58] Csaba Szepesvári,et al. Bandit Based Monte-Carlo Planning , 2006, ECML.

[59] Rina Dechter,et al. AND/OR search spaces for graphical models , 2007, Artif. Intell..

[60] Radford M. Neal. Pattern Recognition and Machine Learning , 2007, Technometrics.

[61] David Levine,et al. Managing Power Consumption and Performance of Computing Systems Using Reinforcement Learning , 2007, NIPS.

[62] Srini Narayanan,et al. Learning all optimal policies with multiple criteria , 2008, ICML '08.

[63] Bart De Schutter,et al. A Comprehensive Survey of Multiagent Reinforcement Learning , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[64] David Hsu,et al. SARSOP: Efficient Point-Based POMDP Planning by Approximating Optimally Reachable Belief Spaces , 2008, Robotics: Science and Systems.

[65] Rina Dechter,et al. And/or search strategies for combinatorial optimization in graphical models , 2008 .

[66] Michael I. Jordan,et al. Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[67] Ruggiero Cavallo,et al. Efficiency and redistribution in dynamic mechanism design , 2008, EC '08.

[68] Emma Rollón,et al. Multi-objective optimization in graphical models , 2008 .

[69] Patrice Perny,et al. Multiobjective Optimization using GAI Models , 2009, IJCAI.

[70] Hisashi Handa. Solving Multi-objective Reinforcement Learning Problems by EDA-RL - Acquisition of Various Strategies , 2009, 2009 Ninth International Conference on Intelligent Systems Design and Applications.

[71] Andrei V. Kelarev,et al. Constructing Stochastic Mixture Policies for Episodic Multiobjective Reinforcement Learning Tasks , 2009, Australasian Conference on Artificial Intelligence.

[72] Nir Friedman,et al. Probabilistic Graphical Models - Principles and Techniques , 2009 .

[73] Patrice Perny,et al. Choquet Optimization Using GAI Networks for Multiagent/Multicriteria Decision-Making , 2009, ADT.

[74] Hisashi Handa. EDA-RL: estimation of distribution algorithms for reinforcement learning problems , 2009, GECCO '09.

[75] Radu Marinescu,et al. Exploiting Problem Decomposition in Multi-objective Constraint Optimization , 2009, CP.

[76] Susan A. Murphy,et al. Efficient Reinforcement Learning with Multiple Reward Functions for Randomized Controlled Trial Analysis , 2010, ICML.

[77] Edmund H. Durfee,et al. Influence-Based Policy Abstraction for Weakly-Coupled Dec-POMDPs , 2010, ICAPS.

[78] Frans A. Oliehoek,et al. Value-Based Planning for Teams of Agents in Stochastic Partially Observable Environments , 2010 .

[79] David Hsu,et al. Planning under Uncertainty for Robotic Tasks with Mixed Observability , 2010, Int. J. Robotics Res..

[80] Peter Auer,et al. UCB revisited: Improved regret bounds for the stochastic multi-armed bandit problem , 2010, Period. Math. Hung..

[81] Sven Koenig,et al. BnB-ADOPT: an asynchronous branch-and-bound DCOP algorithm , 2008, AAMAS.

[82] Radu Marinescu. Efficient Approximation Algorithms for Multi-objective Constraint Optimization , 2011, ADT.

[83] Qiang Liu,et al. Bounding the Partition Function using Holder's Inequality , 2011, ICML.

[84] D. Pardoe. Adaptive trading agent strategies using market experience , 2011 .

[85] Yiannis Demiris,et al. Evolving policies for multi-reward partially observable markov decision processes (MR-POMDPs) , 2011, GECCO '11.

[86] Tommi S. Jaakkola,et al. Introduction to dual composition for inference , 2011 .

[87] Qiang Liu,et al. Variational algorithms for marginal MAP , 2011, J. Mach. Learn. Res..

[88] Nicholas R. Jennings,et al. Bounded decentralised coordination over multiple objectives , 2011, AAMAS.

[89] Yiannis Demiris,et al. Multi-reward policies for medical applications: anthrax attacks and smart wheelchairs , 2011, GECCO.

[90] Kee-Eung Kim,et al. Closing the Gap: Improved Bounds on Optimal POMDP Solutions , 2011, ICAPS.

[91] Guy Shani,et al. Noname manuscript No. (will be inserted by the editor) A Survey of Point-Based POMDP Solvers , 2022 .

[92] Istvan Szita,et al. Reinforcement Learning in Games , 2012, Reinforcement Learning.

[93] Thomas Keller,et al. PROST: Probabilistic Planning Based on UCT , 2012, ICAPS.

[94] Nic Wilson,et al. Multi-objective Influence Diagrams , 2012, UAI.

[95] Lars Otten,et al. Join-graph based cost-shifting schemes , 2012, UAI.

[96] Shie Mannor,et al. Bayesian Reinforcement Learning , 2012, Reinforcement Learning.

[97] Peter Vrancx,et al. Reinforcement Learning: State-of-the-Art , 2012 .

[98] Hado van Hasselt,et al. Reinforcement Learning in Continuous State and Action Spaces , 2012, Reinforcement Learning.

[99] Bo An,et al. Multi-objective optimization for security games , 2012, AAMAS.

[100] Matthijs T. J. Spaan,et al. Partially Observable Markov Decision Processes , 2010, Encyclopedia of Machine Learning.

[101] Shimon Whiteson,et al. Exploiting Structure in Cooperative Bayesian Games , 2012, UAI.

[102] Shimon Whiteson,et al. Computing Convex Coverage Sets for Multi-objective Coordination Graphs , 2013, ADT.

[103] Charles L. Isbell,et al. Point Based Value Iteration with Optimal Belief Compression for Dec-POMDPs , 2013, NIPS.

[104] Patrice Perny,et al. Approximation of Lorenz-Optimal Solutions in Multiobjective Markov Decision Processes , 2013, AAAI.

[105] Malte Helmert,et al. Trial-Based Heuristic Tree Search for Finite Horizon MDPs , 2013, ICAPS.

[106] Ashutosh Nayyar,et al. Decentralized Stochastic Control with Partial History Sharing: A Common Information Approach , 2012, IEEE Transactions on Automatic Control.

[107] Jan Peters,et al. Reinforcement learning in robotics: A survey , 2013, Int. J. Robotics Res..

[108] Ann Nowé,et al. Designing multi-objective multi-armed bandits algorithms: A study , 2013, The 2013 International Joint Conference on Neural Networks (IJCNN).

[109] Wolfgang Ketter,et al. Autonomous Agents in Future Energy Markets: The 2012 Power Trading Agent Competition , 2013, AAAI.

[110] Shimon Whiteson,et al. Multi-objective variable elimination for collaborative graphical games , 2013, AAMAS.

[111] Rina Dechter. Reasoning with Probabilistic and Deterministic Graphical Models: Exact Algorithms , 2013, Reasoning with Probabilistic and Deterministic Graphical Models: Exact Algorithms.

[112] Shimon Whiteson,et al. A Survey of Multi-Objective Sequential Decision-Making , 2013, J. Artif. Intell. Res..

[113] Mathijs de Weerdt,et al. Planning under Uncertainty for Coordinating Infrastructural Maintenance , 2013, ICAPS.

[114] Olivier Buffet,et al. Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence Optimally Solving Dec-POMDPs as Continuous-State MDPs , 2022 .

[115] Leo van Moergestel,et al. Agent Technology in Agile Multiparallel Manufacturing and Product Support , 2014 .

[116] Shimon Whiteson,et al. Queued Pareto Local Search for Multi-Objective Optimization , 2014, PPSN.

[117] Shimon Whiteson,et al. Linear support for multi-objective coordination graphs , 2014, AAMAS.

[118] Bernard Manderick,et al. The scalarized multi-objective multi-armed bandit problem: An empirical study of its exploration vs. exploitation tradeoff , 2014, 2014 International Joint Conference on Neural Networks (IJCNN).

[119] Peter R. Lewis,et al. A novel adaptive weight selection algorithm for multi-objective multi-agent reinforcement learning , 2014, 2014 International Joint Conference on Neural Networks (IJCNN).

[120] Marco Wiering,et al. Model-based multi-objective reinforcement learning , 2014, 2014 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL).

[121] Shimon Whiteson,et al. Bounded Approximations for Linear Multi-Objective Planning Under Uncertainty , 2014, ICAPS.

[122] Ann Nowé,et al. Multi-objective reinforcement learning using sets of pareto dominating policies , 2014, J. Mach. Learn. Res..

[123] Dec-POMDPs as Non-Observable MDPs , 2014 .

[124] Doina Precup,et al. Algorithms for multi-armed bandit problems , 2014, ArXiv.

[125] Shimon Whiteson,et al. Computing Convex Coverage Sets for Faster Multi-objective Coordination , 2015, J. Artif. Intell. Res..

[126] Shimon Whiteson,et al. Point-Based Planning for Multi-Objective POMDPs , 2015, IJCAI.

[127] Frans A. Oliehoek,et al. Structure in the value function of zero-sum games of incomplete information , 2015 .

[128] Shimon Whiteson. Pareto Local Policy Search for MOMDP Planning , 2015 .

[129] Frans A. Oliehoek,et al. Quality Assessment of MORL Algorithms: A Utility-Based Approach , 2015 .

[130] Nic Wilson,et al. Computing Possibly Optimal Solutions for Multi-Objective Constraint Optimisation with Tradeoffs , 2015, IJCAI.

[131] Diederik M. Roijers. Variational Multi-Objective Coordination , 2015 .

[132] Patrice Perny,et al. Incremental Weight Elicitation for Multiobjective State Space Search , 2015, AAAI.

[133] Shlomo Zilberstein,et al. Multi-Objective POMDPs with Lexicographic Reward Preferences , 2015, IJCAI.

[134] Frans A. Oliehoek,et al. Factored Upper Bounds for Multiagent Planning Problems under Uncertainty with Non-Factored Value Functions , 2015, IJCAI.

[135] Mathijs de Weerdt,et al. Solving Multi-agent MDPs Optimally with Conditional Return Graphs , 2015 .

[136] Mathijs de Weerdt,et al. Solving Transition-Independent Multi-Agent MDPs with Sparse Interactions , 2015, AAAI.

[137] Juliane Hahn,et al. Security And Game Theory Algorithms Deployed Systems Lessons Learned , 2016 .

[138] Shimon Whiteson,et al. Multi-Objective Decision Making , 2017, Synthesis Lectures on Artificial Intelligence and Machine Learning.