论文信息 - Implementing an Improved Reinforcement Learning Algorithm for the Simulation of Weekly Activity-Travel Sequences

Implementing an Improved Reinforcement Learning Algorithm for the Simulation of Weekly Activity-Travel Sequences

Recently, within the area of activity-based travel demand modeling there is a general tendency to enhance the realism of these models by incorporating dynamics based on learning and adaptation processes. The research presented here attempts at contributing to the current state of the art by formulating a framework for the simulation of individual activity-travel patterns. To this end, the current research redesigns an existing reinforcement learning technique by adding a regression-tree function approximator. This artifice enables the Q-learning algorithm not only to consider more explanatory and decision variables, but also to handle a larger granularity of these dimensions. In addition, the reward function underlying the Q-learning process is drawn up carefully based on activity attributes rather than activity type. For the purpose of testing the applicability of the proposed improvements, a prototype model is implemented and applied to real-world data. The prototype model proves to learn weekly activity-travel patterns rather quickly, requiring only a limited amount of memory. Additionally, in order to validate the suggested approach, the simulated weekly activity-travel sequences are compared to the observed ones by assessing the dissimilarity based on a number of distance measures. Vanhulsel, Janssens, Wets, Vanhoof 3 INTRODUCTION Since the introduction of activity-based travel demand modeling, several methods have been applied to forecast individual activity-travel behavior. Some more traditional techniques include logit and nested logit models (e.g. Day Activity Schedule (1) and PCATS (2)), Monte Carlo simulations (e.g. RAP (3) and SMASH (4)) and discrete choice models (e.g. CEMDAP (5) and MORPC (6)). However, more advanced computational methods, such as rule-based systems (e.g. ALBATROSS (7)), genetic algorithms (e.g. AURORA (8) and TASHA (9)) and reinforcement learning (e.g. (10)(11)(12)(13)), have been developed as well. However, the latter modeling approach in it most elementary application has proven to be insufficient for use within the current research area. (10) Therefore, this paper aims at contributing to the understanding and modeling of activity-travel sequences by: • redesigning the simple reinforcement learning algorithm based on techniques originating from the area of artificial intelligence; • developing a first prototype based on this improved reinforcement learning algorithm; and • validating the applicability of this approach by applying the prototype to a small dataset. First the research problem will be introduced in the course of a brief literature overview. Then, the basic concepts of reinforcement learning will be discussed. Next, the reinforcement learning approach extended with a regression tree-based function approximator will be elaborated. Subsequently, the data underlying the empirical section will be described. The improved reinforcement learning method will be applied to these data and the results will be presented. Conclusions and issues for future research can be found in the final section. PROBLEM DESCRIPTION The main assumption of activity-based travel demand models includes that travel is derived from individual activity schedules. Indeed, individuals execute certain activities at certain locations in their attempt to achieve certain goals. To get to the desired locations, individuals need to travel. Activity-based models thus focus on predicting simultaneously several activity-travel related dimensions, such as the activity type, duration, location and transport mode used to get to this location. The resulting activitytravel patterns constitute the basis of the assignment of the individual routes to the transportation network when estimating the aggregate travel demand. As a result, activity-based transportation models offer the opportunity of predicting travel demand more accurately because they provide a more profound insight into individual activity-travel behavior. (7)(14) Initially, activity-based modeling efforts focused at deriving models to schedule activity and travel episodes in order to match the observed activity-travel patterns, assuming a non-changing environment and fixed individual preferences. However nowadays, it is accepted that adaptation and learning need to be incorporated into the modeling framework, as individuals are part of an ever-changing environment. After all, interacting with this dynamic environment causes continuous adjustments of individual preferences, opinions and expectations. Consequently, individual decisions are prone to changes as they are taken conditionally upon previously gathered knowledge. To this purpose, dynamic activity-based models -in which individuals determine their activity-travel schedules dynamically by entering the transportation network simultaneously and interacting with each other-, are developed. (14)(15) The modeling effort proposed here aims at capturing these dynamics by use of a reinforcement learning technique. REINFORCEMENT LEARNING This section only provides a brief overview of the core of reinforcement learning. A more comprehensive description of the reinforcement learning technique can be found in Sutton and Barto (16) and Smart and Kaelbling (17). Generally, a reinforcement learning problem attempts to find an optimal policy -which consists of searching a rule to select the action yielding the highest reward in a given state. Vanhulsel, Janssens, Wets, Vanhoof 4 The reinforcement learning process can be summarized as follows. The individual or so-called agent first perceives the state of the environment, which is composed of a number of observable variables. Based on these observations, the agent chooses an action to be performed, which boils down to determining the values of a number of decision variables. The execution of the selected action causes changes in the state of the environment. As the agent continuously interacts with his environment, the agent perceives this state transition and values its benefit. This value can be either positive (reward) or negative (penalty). The agent then processes and memorizes the triplet containing the state, the action and the reward or penalty. Subsequently, the agent starts all over again, observing the state of the environment in order to select the next action. When selecting an action, the agent appeals to the stored triplets: when faced with a state similar to a previously encountered state, actions having lead to a reward will be reinforced, while actions associated to a penalty will be avoided. (16) In the course of this learning process, the agent continuously trades off exploration of all feasible actions versus exploitation of the knowledge gathered so far. To this end, an exploration parameter pexplore is defined to reflect the probability of selecting a random action instead of the currently best one. (11) To allow for a real-world setting in which no perfect knowledge about the environment is available to the agent, the approach in the current research is founded on the model-free Q-learning technique, which also enables the agent to learn from delayed rewards. (16) A basic concept within Qlearning is the Q-value which reflects the expected value of selecting action a in state s and following the optimal policy thereafter. A Q-value corresponds to a particular state-action pair (s,a) and can be decomposed into the value of the immediate reward (or penalty) R(s,a) and the value of the next state, discounted by a factor γ: ) ' , ' ( ' max . ) , ( ) , ( a s Q a a s R a s Q γ + = (1) Additionaly, the learning rate α has been introduced to enable recording previously gathered knowledge in the Q-value. The learning rate indicates the weight assigned to the value of the state-action pair (s,a) calculated according to equation (1), versus the Q-value ) , ( a s t Q computed during a previous visit to the same state-action pair. The Q-value, now defined as ) , ( 1 a s t Q + , can then be rewritten to:     + + − = + ) ' , ' ( ' max . ) , ( . ) , ( ). 1 ( ) , ( 1 a s t Q a a s R a s t Q a s t Q γ α α (2) In the course of the learning process, these Q-values are stored in a look-up table of which each entry corresponds to a combination of feasible values for all the dimensions of the state and the action. (16) The Q-learning algorithm can be outlined as follows: Vanhulsel, Janssens, Wets, Vanhoof 5 Initialize Q-values. Repeat N times (N = number of learning episodes) Select random state s0. Set s = s0. Repeat until the end of the learning episode Select action a. Choose exploration parameter randomly. If exploration parameter > exploration rate pexplore Choose best action. Repeat for each action from set of feasible actions A(s) Look up Q-value. Add to list. Select action a from this list with the highest Q-value. Else Choose random action. Choose action a randomly from set of feasible actions A(s). Receive immediate reward R(s,a). Observe next state s’. Update Q-value of state-action pair (s,a) according to equation (2). Set s = s’. The application of reinforcement learning within activity-based models is not novel, since it has been implemented before by Charypar and Nagel (10), Janssens et al (11) and Timmermans and Arentze (12). Yet, the application of this traditional reinforcement learning algorithm in this research area involves some limitations, as will be revealed in the following paragraph. FUNCTION APPROXIMATION Limitations of reinforcement learning algorithm First, the traditional reinforcement algorithm is not able to account efficiently for changes in the agent’s environment, as this approach requires retraining the Q-function when changes occur. Yet, individuals do not operate in a static environment as already pointed out in the problem description. Next, the traditional algorithm requires visiting all feasible state-action pairs at least once, and preferably an infinite number of times to converge to the optimal policy. In additio

Davy Janssens | Geert Wets | Marlies Vanhulsel

[1] Andrew W. Moore,et al. Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[2] Leslie Pack Kaelbling,et al. Practical Reinforcement Learning in Continuous Spaces , 2000, ICML.

[3] Ta Theo Arentze,et al. Modeling learning and adaptation processes in activity-travel choice A framework and numerical experiment , 2003 .

[4] F. Koppelman,et al. History Dependency in Daily Activity Participation and Time Allocation for Commuters , 2002 .

[5] Martin Dijst,et al. Time windows in workers' activity patterns: Empirical evidence from the Netherlands , 2003 .

[6] Andrea S. Foulkes,et al. Classifcation and Regression Trees , 2009 .

[7] Jessica Y. Guo,et al. A Comprehensive Econometric Micro-simulator for Daily Activity-travel Patterns ( CEMDAP ) , 2004 .

[8] Sean T. Doherty. Should we abandon activity type analysis? Redefining activities by their salient attributes , 2006 .

[9] Davy Janssens,et al. Calibrating a New Reinforcement Learning Mechanism for Modeling Dynamic Activity-Travel Behavior and Key Events , 2007 .

[10] Hjp Harry Timmermans,et al. MODELLING LEARNING AND ADAPTATION IN TRANSPORTATION CONTEXTS , 2005 .

[11] Michael G. McNally,et al. A Microsimulation of Daily Activity Patterns , 2000 .