XKD: Cross-modal Knowledge Distillation with Domain Alignment for Video Representation Learning