Contrastive Learning of Semantic and Visual Representations for Text Tracking