Overview of the seventh Dialog System Technology Challenge: DSTC7

Abstract This paper provides detailed information about the seventh Dialog System Technology Challenge (DSTC7) and its three tracks aimed to explore the problem of building robust and accurate end-to-end dialog systems. In more detail, DSTC7 focuses on developing and exploring end-to-end technologies for the following three pragmatic challenges: (1) sentence selection for multiple domains, (2) generation of informational responses grounded in external knowledge, and (3) audio visual scene-aware dialog to allow conversations with users about objects and events around them. This paper summarizes the overall setup and results of DSTC7, including detailed descriptions of the different tracks, provided datasets and annotations, overview of the submitted systems and their final results. For Track 1, LSTM-based models performed best across both datasets, allowing teams to effectively handle task variants where no correct answer was present or when multiple paraphrases were included. For Track 2, RNN-based architectures augmented to incorporate facts by using two types of encoders: a dialog encoder and a fact encoder plus using attention mechanisms and a pointer-generator approach provided the best results. Finally, for Track 3, the best model used Hierarchical Attention mechanisms to combine the text and vision information obtaining a 22% better result than the baseline LSTM system for the human rating score. More than 220 participants were registered and about 40 teams participated in the final challenge. 32 scientific papers reporting the systems submitted to DSTC7, and 3 general technical papers for dialog technologies, were presented during the one-day wrap-up workshop at AAAI-19. During the workshop, we reviewed the state-of-the-art systems, shared novel approaches to the DSTC7 tasks, and discussed the future directions for the challenge (DSTC8).

[1]  DSTC 7-AVSD : Scene-Aware Video-Dialogue Systems with Dual Attention , 2018 .

[2]  Ali Farhadi,et al.  Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding , 2016, ECCV.

[3]  Florian Metze,et al.  CMU Sinbad’s Submission for the DSTC7 AVSD Challenge , 2019 .

[4]  Jianfeng Gao,et al.  A Diversity-Promoting Objective Function for Neural Conversation Models , 2015, NAACL.

[5]  John R. Hershey,et al.  Attention-Based Multimodal Fusion for Video Description , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[6]  Hang Li,et al.  Neural Responding Machine for Short-Text Conversation , 2015, ACL.

[7]  Antoine Raux,et al.  The Dialog State Tracking Challenge , 2013, SIGDIAL Conference.

[8]  Quoc V. Le,et al.  A Neural Conversational Model , 2015, ArXiv.

[9]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[10]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[11]  Steven C. H. Hoi,et al.  End-to-End Multimodal Dialog Systems with Hierarchical Multimodal Attention on Video Features , 2018 .

[12]  Rafael E. Banchs,et al.  The fifth dialog state tracking challenge , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[13]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[14]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[15]  Jason Weston,et al.  End-To-End Memory Networks , 2015, NIPS.

[16]  Jatin Ganhotra,et al.  Analyzing Assumptions in Conversation Disentanglement Research Through the Lens of a New Dataset and Model , 2018, ArXiv.

[17]  Tim K. Marks,et al.  Audio Visual Scene-aware dialog (AVSD) Track for Natural Language Generation in DSTC7 , 2019 .

[18]  Joelle Pineau,et al.  Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models , 2015, AAAI.

[19]  Matthew Henderson,et al.  The third Dialog State Tracking Challenge , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[20]  Hannes Schulz,et al.  Relevance of Unsupervised Metrics in Task-Oriented Dialogue for Evaluating Natural Language Generation , 2017, ArXiv.

[21]  Haizhou Li,et al.  Automatic evaluation of end-to-end dialog systems with adequacy-fluency metrics , 2019, Comput. Speech Lang..

[22]  Jun Zhao,et al.  Generating Natural Answers by Incorporating Copying and Retrieving Mechanisms in Sequence-to-Sequence Learning , 2017, ACL.

[23]  Jonathan K. Kummerfeld SLATE: A Super-Lightweight Annotation Tool for Experts , 2019, ACL.

[24]  Walter S. Lasecki,et al.  Understanding Task Design Trade-offs in Crowdsourced Paraphrase Collection , 2017, ACL.

[25]  Wen Wang,et al.  Sequential Attention-based Network for Noetic End-to-End Response Selection , 2019, ArXiv.

[26]  Takahiro Shinozaki,et al.  Investigation of Attention-Based Multimodal Fusion and Maximum Mutual Information Objective for DSTC7 Track3 , 2019 .

[27]  Alan Ritter,et al.  Data-Driven Response Generation in Social Media , 2011, EMNLP.

[28]  Tien Dat Nguyen,et al.  From FiLM to Video: Multi-turn Question Answering with Multi-modal Context , 2018, ArXiv.

[29]  Anoop Cherian,et al.  Audio Visual Scene-Aware Dialog (AVSD) Challenge at DSTC7 , 2018, ArXiv.

[30]  Zhen-Hua Ling,et al.  Enhanced LSTM for Natural Language Inference , 2016, ACL.

[31]  Thomas Wolf,et al.  Transfer Learning in Natural Language Processing , 2019, NAACL.

[32]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[33]  Anoop Cherian,et al.  Audio Visual Scene-Aware Dialog , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Jianfeng Gao,et al.  A Neural Network Approach to Context-Sensitive Generation of Conversational Responses , 2015, NAACL.

[35]  Lun-Wei Ku,et al.  Entropy-Enhanced Multimodal Attention Model for Scene-Aware Dialogue Generation , 2019, ArXiv.

[36]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Dilek Z. Hakkani-Tür,et al.  Advancing the State of the Art in Open Domain Dialog Systems through the Alexa Prize , 2018, ArXiv.

[38]  Lama Nachman,et al.  Context, Attention and Audio Feature Explorations for Audio Visual Scene-Aware Dialog , 2018, ArXiv.

[39]  Gavriel Salomon,et al.  T RANSFER OF LEARNING , 1992 .

[40]  Ming-Wei Chang,et al.  A Knowledge-Grounded Neural Conversation Model , 2017, AAAI.

[41]  Haizhou Li,et al.  Adequacy–Fluency Metrics: Evaluating MT in the Continuous Space Model Framework , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[42]  Jason Weston,et al.  Memory Networks , 2014, ICLR.

[43]  Philipp Koehn,et al.  Neural Machine Translation , 2017, ArXiv.

[44]  Anoop Cherian,et al.  Joint Student-Teacher Learning for Audio-Visual Scene-Aware Dialog , 2019, INTERSPEECH.

[45]  José M. F. Moura,et al.  Visual Dialog , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Yun-Nung Chen,et al.  Reactive Multi-Stage Feature Fusion for Multimodal Dialogue Modeling , 2019, ArXiv.

[47]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[48]  Takaaki Hori,et al.  End-to-end Conversation Modeling Track in DSTC6 , 2017, ArXiv.

[49]  Christopher D. Manning,et al.  Get To The Point: Summarization with Pointer-Generator Networks , 2017, ACL.

[50]  Hang Li,et al.  “ Tony ” DNN Embedding for “ Tony ” Selective Read for “ Tony ” ( a ) Attention-based Encoder-Decoder ( RNNSearch ) ( c ) State Update s 4 SourceVocabulary Softmax Prob , 2016 .

[51]  Xiaodong Liu,et al.  Conversing by Reading: Contentful Neural Conversation with On-demand Machine Reading , 2019, ACL.

[52]  Y-Lan Boureau,et al.  Overview of the sixth dialog system technology challenge: DSTC6 , 2019, Comput. Speech Lang..

[53]  Jatin Ganhotra,et al.  Knowledge-incorporating ESIM models for Response Selection in Retrieval-based Dialog Systems , 2019, ArXiv.

[54]  Maria Leonor Pacheco,et al.  of the Association for Computational Linguistics: , 2001 .

[55]  Matthew Henderson,et al.  The Second Dialog State Tracking Challenge , 2014, SIGDIAL Conference.

[56]  George R. Doddington,et al.  Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics , 2002 .

[57]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[58]  Anoop Cherian,et al.  End-to-end Audio Visual Scene-aware Dialog Using Multimodal Attention-based Video Features , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[59]  Sungjin Lee,et al.  Jointly Optimizing Diversity and Relevance in Neural Response Generation , 2019, NAACL.

[60]  Stefan Lee,et al.  Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[61]  Joelle Pineau,et al.  The Second Conversational Intelligence Challenge (ConvAI2) , 2019, The NeurIPS '18 Competition.

[62]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with High Levels of Correlation with Human Judgments , 2007, WMT@ACL.

[63]  Joelle Pineau,et al.  A Survey of Available Corpora for Building Data-Driven Dialogue Systems , 2015, Dialogue Discourse.

[64]  Aren Jansen,et al.  CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[65]  Joelle Pineau,et al.  The Ubuntu Dialogue Corpus: A Large Dataset for Research in Unstructured Multi-Turn Dialogue Systems , 2015, SIGDIAL Conference.

[66]  Rafael E. Banchs,et al.  The Fourth Dialog State Tracking Challenge , 2016, IWSDS.

[67]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).