Are Vision-Language Transformers Learning Multimodal Representations? A Probing Perspective