Collecting Interactive Multi-modal Datasets for Grounded Language Understanding