Benchmarking Robustness under Distribution Shift of Multimodal Image-Text Models