VITR: Augmenting Vision Transformers with Relation-Focused Learning for Cross-Modal Information Retrieval