Grounding Spatio-Temporal Language with Transformers