Improving Cross-Modal Understanding in Visual Dialog Via Contrastive Learning