Learning shared embedding representation of motion and text using contrastive learning