Verbal Focus-of-Attention System for Learning-from-Observation

The learning-from-observation (LfO) framework aims to map human demonstrations to a robot to reduce programming effort. To this end, an LfO system encodes a human demonstration into a series of execution units for a robot, which are referred to as task models. Although previous research has proposed successful task-model encoders, there has been little discussion on how to guide a task-model encoder in a scene with spatio-temporal noises, such as cluttered objects or unrelated human body movements. Inspired by the function of verbal instructions guiding an observer’s visual attention, we propose a verbal focus-of-attention (FoA) system (i.e., spatiotemporal filters) to guide a task-model encoder. For object manipulation, the system first recognizes the name of a target object and its attributes from verbal instructions. The information serves as a where-to-look FoA filter to confine the areas in which the target object existed in the demonstration. The system then detects the timings of grasp and release that occurred in the filtered areas. The timings serve as a when-to-look FoA filter to confine the period of object manipulation. Finally, a task-model encoder recognizes the task models by employing the FoA filters. We demonstrate the robustness of the verbal FoA in attenuating spatio-temporal noises by comparing it with an existing action localization network. The contributions of this study are as follows: (1) to propose a verbal FoA for LfO, (2) to design an algorithm to calculate FoA filters from verbal input, and (3) to demonstrate the effectiveness of a verbal FoA in localizing an action by comparing it with a state-of-the-art vision system.