Equivariant Similarity for Vision-Language Foundation Models