Master 2 Internship Proposal: Multimodal Vision-Language Pretraining