Learning Audio-Video Modalities from Image Captions