Exposing the Limits of Video-Text Models through Contrast Sets