论文信息 - State Space Reduction For Hierarchical Reinforcement Learning

State Space Reduction For Hierarchical Reinforcement Learning

This paper provides new techniques for abstracting the state space of a Markov Decision Process (MDP). These techniques extend one of the recent minimization models, known as -reduction, to construct a partition space that has a smaller number of states than the original MDP. As a result, learning policies on the partition space should be faster than on the original state space. The technique presented here extends reduction to SMDPs by executing a policy instead of a single action, and grouping all states which have a small difference in transition probabilities and reward function under a given policy. When the reward structure is not known, a two-phase method for state aggregation is introduced and a theorem in this paper shows the solvability of tasks using the two-phase method partitions. These partitions can be further refined when the complete structure of reward is available. Simulations of different state spaces show that the policies in both MDP and this representation achieve similar results and the total learning time in partition space in presented approach is much smaller than the total amount of time spent on learning on the original state space. Introduction Markov decision processes (MDPs) are useful ways to model stochastic environments, as there are well established algorithms to solve these models. Even though these algorithms find an optimal solution for the model, they suffer from the high time complexity when the number of decision points is large(Parr 1998; Dietterich 2000). To address increasingly complex problems a number of approaches have been used to design state space representations in order to increase the efficiency of learning (Dean Thomas; Kaelbling & Nicholson 1995; Dean & Robert 1997). Here particular features are hand-designed based on the task domain and the capabilities of the learning agent. In autonomous systems, however, this is generally a difficult task since it is hard to anticipate which parts of the underlying physical state are important for the given decision making problem. Moreover, in hierarchical learning approaches the required information might change over time as increasingly competent actions become available. The same can be observed in biological systems where information about all muscle Copyright c © 2004, American Association for Artificial Intelligence (www.aaai.org). All rights reserved. fibers is initially instrumental to generate strategies for coordinated movement. However, as such strategies become established and ready to be used, this low-level information does no longer have to be consciously taken into account. The methods presented here build on the -reduction technique developed by Dean et al.(Givan & Thomas 1995) to derive representations in the form of state space partitions that ensure that the utility of a policy learned in the reduced state space is within a fixed bound of the optimal policy. The presented methods here extend the -reduction technique by including policies as actions and thus using it to find approximate SMDP reductions. Furthermore it derives partitions for individual actions and composes them into representations for any given subset of the action space. This is further extended by permitting the definition of two-phase partitioning that is initially reward independent and can later be refined once the reward function is known. In particular the techniques described in the following subsections are to extend -reduction(Thomas Dean & Leach 1997) by introducing the following methods: • Temporal abstraction • Action dependent decomposition • Two-phase decomposition Formalism A Markov decision processes (MDP ) is a 4-tuple (S,A, P,R) where S is the set of states, A is a set of actions available in each state, P is a transition probability function that assigns a value 0 ≤ p ≤ 1 to each state-action pair, and R is the reward function. A transition function is a map P : S × A × S → [0, 1] and usually is denoted by P (s|s, a), which is the probability that executing action a in state s will lead to state s . Similarly, a reward function is a map R : S × A → and R(s, a) denotes the reward gained by executing action a in state s. Any policy defines a value function and the Bellman equation (Bellman 1957; Puterman 1994) creates a connection between the value of each state and the value of other states by: V π(s) = R(s, π(s)) + γ ∑ s ′ P (s′|s, π(s))V π(s′) Previous Work State space reduction methods use the basic concepts of a MDP such as transition probabilities and reward function to represent a large class of states with a single state of the abstract space.The most important issues that show the generated abstraction is a valid approximate MDP are: 1. The difference between the transition function and reward function in both models has to be a small value. 2. For each policy on the original state space there must exist a policy in the abstract model. And if a state s is not reachable from state s in the abstract model, then there should not exist a policy that leads from s to s in the original state space. SMDPs One of the approaches in treating temporal abstraction is to use the theory of semi Markov decision processes (SMDPs). The actions in SMDPs take a variable amount of time and are intended to model temporally extended actions, represented as a sequence of primary actions. Policies: A policy (option) in SMDPs is a triple oi = (Ii, πi, βi)(Boutillier & Hanks 1995), where Ii is an initiation set, πi : S × A −→ [0, 1] is a primary policy and βi : S −→ [0, 1] is a termination condition. When a policy oi is executed, actions are chosen according to πi until the policy terminates stochastically according to βi. The initiation set and termination condition of a policy limit the range over which the policy needs to be defined and determine its termination. Given any set of multi-step actions, we consider the policy over those actions. In this case we need to generalize the definition of value function. The value of a state s under an SMDP policy π is defined as(Boutillier & Goldszmidt 1994):

Manfred Huber | Mehran Asadi | M. Huber | Mehran Asadi

[1] 佐藤保,et al. Principal Components , 2021, Encyclopedic Dictionary of Archaeology.

[2] E. Mark Gold,et al. Language Identification in the Limit , 1967, Inf. Control..

[3] C. N. Liu,et al. Approximating discrete probability distributions with dependence trees , 1968, IEEE Trans. Inf. Theory.

[4] John H. Holland,et al. Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[5] Kathleen Knobe,et al. A Method for Inferring Context-free Grammars , 1976, Inf. Control..

[6] L. R. Rabiner,et al. Some properties of continuous hidden Markov model representations , 1985, AT&T Technical Journal.

[7] Biing-Hwang Juang,et al. Maximum likelihood estimation for multivariate mixture observations of markov chains , 1986, IEEE Trans. Inf. Theory.

[8] N. Littlestone. Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[9] Lawrence R. Rabiner,et al. A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[10] Richard A. Harshman,et al. Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[11] Colin Giles,et al. Learning Context-free Grammars: Capabilities and Limitations of a Recurrent Neural Network with an External Stack Memory (cid:3) , 1992 .

[12] Pat Langley,et al. An Analysis of Bayesian Classifiers , 1992, AAAI.

[13] J. Ross Quinlan,et al. C4.5: Programs for Machine Learning , 1992 .

[14] Susan T. Dumais,et al. LSI meets TREC: A Status Report , 1992, TREC.

[15] D. Wolpert. On Overfitting Avoidance as Bias , 1993 .

[16] Leslie G. Valiant,et al. Cryptographic Limitations on Learning Boolean Formulae and Finite Automata , 1993, Machine Learning: From Theory to Applications.

[17] Susan T. Dumais,et al. Latent Semantic Indexing (LSI): TREC-3 Report , 1994, TREC.

[18] Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[19] Chin-Hui Lee,et al. Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[20] Wai Lam,et al. LEARNING BAYESIAN BELIEF NETWORKS: AN APPROACH BASED ON THE MDL PRINCIPLE , 1994, Comput. Intell..

[21] Pat Langley,et al. Induction of Selective Bayesian Classifiers , 1994, UAI.

[22] Susan T. Dumais,et al. Using Linear Algebra for Intelligent Information Retrieval , 1995, SIAM Rev..

[23] Michael J. Pazzani,et al. Searching for Dependencies in Bayesian Classifiers , 1995, AISTATS.

[24] Leslie Pack Kaelbling,et al. Planning under Time Constraints in Stochastic Domains , 1993, Artif. Intell..

[25] M.W. Berry,et al. Computational Methods for Intelligent Information Access , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[26] Craig Boutilier,et al. Exploiting Structure in Policy Construction , 1995, IJCAI.

[27] Susan T. Dumais. Combining evidence for effective information filtering , 1996 .

[28] Pedro M. Domingos,et al. Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classifier , 1996, ICML.

[29] Vittorio Castelli,et al. The relative value of labeled and unlabeled samples in pattern recognition with an unknown mixing parameter , 1996, IEEE Trans. Inf. Theory.

[30] Padhraic Smyth,et al. Clustering Sequences with Hidden Markov Models , 1996, NIPS.

[31] George K. Kokkinakis,et al. Algorithm for clustering continuous density HMM by recognition error , 1996, IEEE Trans. Speech Audio Process..

[32] Yoav Freund,et al. Experiments with a New Boosting Algorithm , 1996, ICML.

[33] T. Dean,et al. Planning under uncertainty: structural assumptions and computational leverage , 1996 .

[34] Tomás Lozano-Pérez,et al. A Framework for Multiple-Instance Learning , 1997, NIPS.

[35] Harris Drucker,et al. Improving Regressors using Boosting Techniques , 1997, ICML.

[36] Robert Givan,et al. Model Minimization in Markov Decision Processes , 1997, AAAI/IAAI.

[37] Robert Givan,et al. Bounded Parameter Markov Decision Processes , 1997, ECP.

[38] Weiru Liu,et al. Learning belief networks from data: an information theory based approach , 1997, CIKM '97.

[39] Robert Givan,et al. Model Reduction Techniques for Computing Approximately Optimal Solutions for Markov Decision Processes , 1997, UAI.

[40] Peter Auer,et al. On Learning From Multi-Instance Examples: Empirical Evaluation of a Theoretical Approach , 1997, ICML.

[41] Luc De Raedt,et al. Attribute-Value Learning Versus Inductive Logic Programming: The Missing Links (Extended Abstract) , 1998, ILP.

[42] Avrim Blum,et al. The Bottleneck , 2021, Monopsony Capitalism.

[43] Christos H. Papadimitriou,et al. Elements of the Theory of Computation , 1997, SIGA.

[44] Sean R. Eddy,et al. Profile hidden Markov models , 1998, Bioinform..

[45] Oded Maron,et al. Multiple-Instance Learning for Natural Scene Classification , 1998, ICML.

[46] Ayhan Demiriz,et al. Semi-Supervised Support Vector Machines , 1998, NIPS.

[47] Amir B. Geva,et al. Brain state identification and forecasting of acute pathology using unsupervised fuzzy clustering of , 2017 .

[48] Tin Kam Ho,et al. The Random Subspace Method for Constructing Decision Forests , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[49] Leonard E. Trigg,et al. Naive Bayes for regression , 1998 .

[50] Gregory F. Cooper,et al. A Bayesian Network Classifier that Combines a Finite Mixture Model and a NaIve Bayes Model , 1999, UAI.

[51] Thorsten Joachims,et al. Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.

[52] Dan Roth,et al. Learning in Natural Language , 1999, IJCAI.

[53] R. Greiner,et al. Comparing Bayesian Network Classifiers , 1999, UAI.

[54] Thomas G. Dietterich. An Overview of MAXQ Hierarchical Reinforcement Learning , 2000, SARA.

[55] Kamal Nigam,et al. Understanding the Behavior of Co-training , 2000, KDD 2000.

[56] James Ze Wang,et al. SIMPLIcity: Semantics-Sensitive Integrated Matching for Picture LIbraries , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[57] Paul N. Bennett. Assessing the Calibration of Naive Bayes Posterior Estimates , 2000 .

[58] Haym Hirsh,et al. Improving Short Text Classification Using Unlabeled Background Knowledge , 2000, ICML 2000.

[59] James T. Kwok,et al. Rival penalized competitive learning for model-based sequence clustering , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[60] S. Roberts,et al. Estimation of coupled hidden Markov models with application to biosignal interaction modelling , 2000, Neural Networks for Signal Processing X. Proceedings of the 2000 IEEE Signal Processing Society Workshop (Cat. No.00TH8501).

[61] Tong Zhang,et al. The Value of Unlabeled Data for Classification Problems , 2000, ICML 2000.

[62] Ying Wu,et al. Self-Supervised Learning for Visual Tracking and Recognition of Human Hand , 2000, AAAI/IAAI.

[63] Jesús Cid-Sueiro,et al. An entropy minimization principle for semi-supervised terrain classification , 2000, Proceedings 2000 International Conference on Image Processing (Cat. No.00CH37101).

[64] Avrim Blum,et al. Learning from Labeled and Unlabeled Data using Graph Mincuts , 2001, ICML.

[65] Sally A. Goldman,et al. Multiple-Instance Learning of Real-Valued Data , 2001, J. Mach. Learn. Res..

[66] Tom M. Mitchell,et al. Using unlabeled data to improve text classification , 2001 .

[67] Russell Greiner,et al. Learning Bayesian Belief Network Classifiers: Algorithms and System , 2001, Canadian Conference on AI.

[68] David Page,et al. Multiple Instance Regression , 2001, ICML.

[69] Haym Hirsh,et al. Using LSI for text classification in the presence of background text , 2001, CIKM '01.

[70] D. Hand,et al. Idiot's Bayes—Not So Stupid After All? , 2001 .

[71] Charles X. Ling,et al. Learnability of Augmented Naive Bayes in Nonimal Domains , 2001, ICML.

[72] Qi Zhang,et al. EM-DD: An Improved Multiple-Instance Learning Technique , 2001, NIPS.

[73] Thomas Hofmann,et al. Support Vector Machines for Multiple-Instance Learning , 2002, NIPS.

[74] Thorsten Joachims,et al. Learning to classify text using support vector machines - methods, theory and algorithms , 2002, The Kluwer international series in engineering and computer science.

[75] Joydeep Ghosh,et al. HMMs and Coupled HMMs for multi-channel EEG classification , 2002, Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN'02 (Cat. No.02CH37290).

[76] Haym Hirsh,et al. Integrating Background Knowledge into Nearest-Neighbor Text Classification , 2002, ECCBR.

[77] Fabrizio Sebastiani,et al. Machine learning in automated text categorization , 2001, CSUR.

[78] Cen Li,et al. Applying the Hidden Markov Model Methodology for Unsupervised Learning of Temporal Data , 2002 .

[79] Robert P. W. Duin,et al. Bagging, Boosting and the Random Subspace Method for Linear Classifiers , 2002, Pattern Analysis & Applications.

[80] Thorsten Joachims,et al. Transductive Learning via Spectral Graph Partitioning , 2003, ICML.

[81] Eugene Santos,et al. Implicitly preserving semantics during incremental knowledge base acquisition under uncertainty , 2003, Int. J. Approx. Reason..

[82] Joydeep Ghosh,et al. A Unified Framework for Model-based Clustering , 2003, J. Mach. Learn. Res..

[83] Stefan W. Christensen. Ensemble Construction via Designed Output Distortion , 2003, Multiple Classifier Systems.

[84] Bir Bhanu,et al. A new semi-supervised EM algorithm for image retrieval , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[85] Mehran Asadi. State Space Reduction for Hierarchical Policy Formation , 2003 .

[86] Joshua Zhexue Huang,et al. Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values , 1998, Data Mining and Knowledge Discovery.

[87] Nir Friedman,et al. Bayesian Network Classifiers , 1997, Machine Learning.

[88] Sebastian Thrun,et al. Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[89] Sean R Eddy,et al. What is dynamic programming? , 2004, Nature Biotechnology.

[90] D. Angluin. Negative Results for Equivalence Queries , 1990, Machine Learning.

[91] Jun Zhang,et al. On Generalized Multiple-instance Learning , 2005, Int. J. Comput. Intell. Appl..