Do Vision and Language Encoders Represent the World Similarly?