Learning Noise-Robust Joint Representation for Multimodal Emotion Recognition under Incomplete Data Scenarios