In many decision-making settings, reward is acquired in response to some complex behaviour that an agent realizes over time. An autonomous taxi may receive reward for picking up a passenger and subsequently delivering them to their destination. An assistive robot may receive reward for ensuring a person in their care takes their medication once daily soon after eating. Such reward is acquired by an agent in response to following a path – a sequence of states that collectively capture the reward-worthy behaviour. Reward of this sort is referred to as non-Markovian reward because it is predicated on state history rather than current state. Our concern in this paper is with both the specification and effective exploitation of non-Markovian reward in the context of Markov Decision Processes (MDPs). State-of-the-art UCT-based planners struggle with non-Markovian rewards because of their weak guidance and relatively myopic lookahead. Here we specify non-Markovian reward-worthy behaviour in Linear Temporal Logic. We translate these behaviours to corresponding deterministic finite state automata whose accepting conditions signify satisfaction of the reward-worthy behaviour. These automata accepting conditions form the basis of Markovian rewards that can be solved by off-the-shelf MDP planners, while crucially preserving policy optimality guarantees. We then explore the use of reward shaping to automatically transform these automata-based rewards into reshaped rewards that better guide search. We augmented benchmark MDP domains with non-Markovian rewards and evaluated our technique using PROST, a state-of-the-art heuristic and UCT-based MDP planner. Our experiments demonstrate significantly improved performance achieved by the exploitation of our techniques. The work presented here reflects the use of Linear Temporal Logic to specify nonMarkovian reward, but our approach will work for any formal language for which there is a corresponding automata representation.
[1]
Fred Kröger,et al.
Temporal Logic of Programs
,
1987,
EATCS Monographs on Theoretical Computer Science.
[2]
Martin L. Puterman,et al.
Markov Decision Processes: Discrete Stochastic Dynamic Programming
,
1994
.
[3]
Craig Boutilier,et al.
Rewarding Behaviors
,
1996,
AAAI/IAAI, Vol. 2.
[4]
Craig Boutilier,et al.
Structured Solution Methods for Non-Markovian Decision Processes
,
1997,
AAAI/IAAI.
[5]
Craig Boutilier,et al.
Decision-Theoretic Planning: Structural Assumptions and Computational Leverage
,
1999,
J. Artif. Intell. Res..
[6]
Andrew Y. Ng,et al.
Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping
,
1999,
ICML.
[7]
John K. Slaney,et al.
Decision-Theoretic Planning with non-Markovian Rewards
,
2011,
J. Artif. Intell. Res..
[8]
Csaba Szepesvári,et al.
Bandit Based Monte-Carlo Planning
,
2006,
ECML.
[9]
Jorge A. Baier,et al.
A Heuristic Search Approach to Planning with Temporally Extended Preferences
,
2007,
IJCAI.
[10]
Jorge A. Baier,et al.
Beyond Classical Planning: Procedural Control Knowledge and Preferences in State-of-the-Art Planners
,
2008,
AAAI.
[11]
P. Cochat,et al.
Et al
,
2008,
Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.
[12]
Thomas Keller,et al.
PROST: Probabilistic Planning Based on UCT
,
2012,
ICAPS.
[13]
Giuseppe De Giacomo,et al.
Linear Temporal Logic and Linear Dynamic Logic on Finite Traces
,
2013,
IJCAI.
[14]
Giuseppe De Giacomo,et al.
Synthesis for LTL and LDL on Finite Traces
,
2015,
IJCAI.
[15]
Scott Sanner,et al.
Non-Markovian Rewards Expressed in LTL: Guiding Search Via Reward Shaping
,
2021,
SOCS.