Multimodal Speech Recognition for Language-Guided Embodied Agents