Space-Time Crop & Attend: Improving Cross-modal Video Representation Learning