J OINTLY L EARNING “ W HAT ” AND “ H OW ” FROM I N-STRUCTIONS AND G OAL-S TATES

Training agents to follow instructions requires some way of rewarding them for behavior which accomplishes the intent of the instruction. For non-trivial instructions, which may be either underspecified or contain some ambiguity, it can be difficult or impossible to specify a reward function or obtain relatable expert trajectories for the agent to imitate. For these scenarios, we introduce a method which requires only pairs of instructions and examples of goal states, from which we can jointly learn a model of the instruction-conditional reward and a policy which executes instructions. Our experiments in a gridworld compare the effectiveness of our method with that of RL in a control setting with available ground-truth reward. We furthermore evaluate the generalization of our approach to unseen instructions, and to scenarios where environment dynamics change outside of training, requiring fine-tuning of the policy “in the wild”.