MVP: Multi-Stage Vision-Language Pre-Training via Multi-Level Semantic Alignment