Disentangled Representation Learning for Text-Video Retrieval