Cross-modal Contrastive Learning for Speech Translation