A multi-stage dynamical fusion network for multimodal emotion recognition