The First Conversational Intelligence Challenge

The first Conversational Intelligence Challenge was conducted over 2017 with finals at NIPS conference. The challenge IS aimed at evaluating the state of the art in non-goal-driven dialogue systems (chatbots) and collecting a large dataset of human-to-machine and human-to-human conversations manually labelled for quality. We established a task for formal human evaluation of chatbots that allows to test capabilities of chatbot in topic-oriented dialogue. Instead of traditional chit-chat, participating systems and humans were given a task to discuss a short text. Ten dialogue systems participated in the competition. The majority of them combined multiple conversational models such as question answering and chit-chat systems to make conversations more natural. The evaluation of chatbots was performed by human assessors. Almost 1,000 volunteers were attracted and over 4,000 dialogues were collected during the competition. Final score of the dialogue quality for the best bot was 2.7 compared to 3.8 for human. This demonstrates that current technology allows supporting dialogue on a given topic but with quality significantly lower than that of human. To close this gap we plan to continue the experiments by organising the next conversational intelligence competition. This future work will benefit from the data we collected and dialogue systems that we made available after the competition presented in the paper.

[1]  Joelle Pineau,et al.  On the Evaluation of Dialogue Systems with Next Utterance Classification , 2016, SIGDIAL Conference.

[2]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with High Levels of Correlation with Human Judgments , 2007, WMT@ACL.

[3]  Joelle Pineau,et al.  Hierarchical Neural Network Generative Models for Movie Dialogues , 2015, ArXiv.

[4]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[5]  Joelle Pineau,et al.  How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation , 2016, EMNLP.

[6]  Joelle Pineau,et al.  A Deep Reinforcement Learning Chatbot , 2017, ArXiv.

[7]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[8]  Adrian Lancucki,et al.  A Talker Ensemble: the University of Wrocław's Entry to the NIPS 2017 Conversational Intelligence Challenge , 2018, ArXiv.

[9]  Jason Weston,et al.  Learning End-to-End Goal-Oriented Dialog , 2016, ICLR.

[10]  Joelle Pineau,et al.  Towards an Automatic Turing Test: Learning to Evaluate Dialogue Responses , 2017, ACL.

[11]  Varvara Logacheva,et al.  ConvAI Dataset of Topic-Oriented Human-to-Chatbot Dialogues , 2018 .

[12]  Jason Weston,et al.  End-To-End Memory Networks , 2015, NIPS.

[13]  Jianfeng Gao,et al.  A Persona-Based Neural Conversation Model , 2016, ACL.

[14]  Yang Zhao,et al.  A Conditional Variational Framework for Dialog Generation , 2017, ACL.