A robust and interpretable deep learning framework for multi-modal registration via keypoints