Learning Visually Grounded and Multilingual Representations