Cross-modal retrieval of remote sensing images and text based on self-attention unsupervised deep common feature space