Learning to Answer Questions in Dynamic Audio-Visual Scenarios