Disentangled Speech Embeddings Using Cross-Modal Self-Supervision