Cross-Modal Learning for Audio-Visual Emotion Recognition in Acted Speech