Robust Robotic Control from Pixels using Contrastive Recurrent State-Space Models

Modeling the world can benefit robot learning by providing a rich 1 training signal for shaping an agent’s latent state space. However, learning world 2 models in unconstrained environments over high-dimensional observation spaces 3 such as images is challenging. One source of difficulty is the presence of irrelevant 4 but hard-to-model background distractions, and unimportant visual details of task5 relevant entities. We address this issue by learning a recurrent latent dynamics 6 model which contrastively predicts the next observation. This simple model leads 7 to surprisingly robust robotic control even with simultaneous camera, background, 8 and color distractions. We outperform alternatives such as bisimulation methods 9 which impose state-similarity measures derived from divergence in future reward or 10 future optimal actions. We obtain state-of-the-art results on the Distracting Control 11 Suite, a challenging benchmark for pixel-based robotic control. 12

