Learning Contextually Fused Audio-Visual Representations For Audio-Visual Speech Recognition