Multimodal emotion recognition based on audio and text by using hybrid attention networks