Audio Visual Scene-Aware Dialog Track in DSTC8

Dialog systems need to understand scenes in order to have conversations with users about the objects and events in them. We introduced a new challenge task and dataset for Audio Visual Scene-Aware Dialog (AVSD) in DSTC7, which was the first attempt to combine conversation and multimodal video description into a single end-to-end differentiable network to build scene-aware dialog systems. The winning system of the challenge applied hierarchical attention mechanisms to combine text and visual information, yielding a relative improvement of 22% in the human rating of the output of the winning system vs. that of the baseline system. The language models trained from QA contributed most to this improvement, while the benefits from visual models (such as object or event recognition from videos) were limited. This means that there is still more opportunity to boost performance on the AVSD challenge using video features. To encourage such progress, we propose a second challenge for DSTC8 as a follow-up to the video-based sceneaware dialog track from DSTC7. The task is to generate or select a system response to a query that occurs during a dialog about a video. Participants will use the video, audio, and dialog text data to train end-to-end models. We will again use the AVSD data sets that we collected and used at DSTC7. We may also include additional data sets, such as the How2 and Dense-Captioning datasets, to provide example dialogs that have more long-term history dependence.

[1]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[2]  Stefan Lee,et al.  Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[3]  Ali Farhadi,et al.  Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding , 2016, ECCV.

[4]  Florian Metze,et al.  How2: A Large-scale Dataset for Multimodal Language Understanding , 2018, NIPS 2018.

[5]  Trevor Darrell,et al.  Sequence to Sequence -- Video to Text , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[6]  José M. F. Moura,et al.  Visual Dialog , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  John R. Hershey,et al.  Attention-Based Multimodal Fusion for Video Description , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[8]  Y-Lan Boureau,et al.  Overview of the sixth dialog system technology challenge: DSTC6 , 2019, Comput. Speech Lang..

[9]  Hannes Schulz,et al.  Relevance of Unsupervised Metrics in Task-Oriented Dialogue for Evaluating Natural Language Generation , 2017, ArXiv.

[10]  Quoc V. Le,et al.  A Neural Conversational Model , 2015, ArXiv.

[11]  Anoop Cherian,et al.  Audio Visual Scene-Aware Dialog (AVSD) Challenge at DSTC7 , 2018, ArXiv.