A Multi-level Alignment Training Scheme for Video-and-Language Grounding