Automatic Understanding of the Visual World

One of the central problems of artificial intelligence is machine perception, i.e., the ability to understand the visual world based on input from sensors such as cameras. In this talk, I will present recent progress of my team in this direction. I will start with presenting results on how to generate additional training data using weak annotations, motion information and synthetic data. Next, I will discuss our results for action recognition in videos, where human tubelets have shown to be successful. Our tubelet approach moves away from state-of-the-art frame based approaches and improves classification and localization by relying on joint information from several frames. We show how to extend this type of method to weakly supervised learning of actions, which allows us to scale to large amounts of data with sparse manual annotation. Finally, I will present recent work on grasping with a robot arm based on learning long-horizon manipulations with a hierarchy of RL and imitation-based skills.