Video-Text Representation Learning via Differentiable Weak Temporal Alignment