Joint Speech-Text Embeddings with Disentangled Speaker Features