Learning from feedback on actions past and intended

Robotic learning promises to eventually provide great societal benefits. In contrast to pure trial-and-error learning, human instruction has at least two benefits: (1) Human teaching can lead to much faster learning. For instance, humans can model the delayed outcome of a behavior and give feedback immediately, unambiguously informing the robot of the quality of its recent action. (2) Human instruction can serve to define a task objective, empowering end-users that lack programming skills to customize behavior. The tamer framework [3, 2] was developed to provide a learning mechanism for a specific, psychologically grounded [1] form of teaching—through signals of reward and punishment. tamer breaks the process of interactively learning behaviors from live human reward into three modules: credit assignment, where delayed human reward is applied appropriately to recent events; regression on experienced events and their consequential credited reward to create a predictive model for future reward; and action selection using the model of human reward. tamer differs from traditional reinforcement learning (RL) algorithms—generally powerful algorithms that are intuitive but ultimately ill-suited for learning from human reward—in multiple ways. For instance, human reward is stochastically delayed from the event that prompted it, and tamer acknowledges this delay, absent in traditional reinforcement learning, and adjusts for it. And importantly, human trainers consider the long-term effects of actions, making each reward a complete judgment on the quality of recent actions; therefore, predictions of near-term human reward are analogous to estimates of expected long-term reward in reinforcement learning, simplifying action selection to choosing the action with the highest expected human reward. On multiple tasks, tamer agents have been shown to learn more quickly—sometimes dramatically so—than counterparts that learn from a predefined evaluation function instead of human interaction. Further, the tamer framework