More Diverse Dialogue Datasets via Diversity-Informed Data Collection

Automated generation of conversational dialogue using modern neural architectures has made notable advances. However, these models are known to have a drawback of often producing uninteresting, predictable responses; this is known as the diversity problem. We introduce a new strategy to address this problem, called Diversity-Informed Data Collection. Unlike prior approaches, which modify model architectures to solve the problem, this method uses dynamically computed corpus-level statistics to determine which conversational participants to collect data from. Diversity-Informed Data Collection produces significantly more diverse data than baseline data collection methods, and better results on two downstream tasks: emotion classification and dialogue generation. This method is generalizable and can be used with other corpus-level metrics.

[1]  Alice M. Brawley,et al.  Work experiences on MTurk: Job satisfaction, turnover, and information sharing , 2016, Comput. Hum. Behav..

[2]  Nan Hua,et al.  Universal Sentence Encoder for English , 2018, EMNLP.

[3]  Beng Chin Ooi,et al.  iCrowd: An Adaptive Crowdsourcing Framework , 2015, SIGMOD Conference.

[4]  Sepehr Assadi,et al.  Online Assignment of Heterogeneous Tasks in Crowdsourcing Markets , 2015, HCOMP.

[5]  Alan Ritter,et al.  Generating More Interesting Responses in Neural Conversation Models with Distributional Constraints , 2018, EMNLP.

[6]  Daniel Jurafsky,et al.  A Simple, Fast Diverse Decoding Algorithm for Neural Generation , 2016, ArXiv.

[7]  Sihem Amer-Yahia,et al.  Task Assignment Optimization in Collaborative Crowdsourcing , 2015, 2015 IEEE International Conference on Data Mining.

[8]  Dongyan Zhao,et al.  Get The Point of My Utterance! Learning Towards Effective Responses with Multi-Head Attention Mechanism , 2018, IJCAI.

[9]  Chun-Ju Yang,et al.  Visual Question Answer Diversity , 2018, HCOMP.

[10]  Alexander M. Rush,et al.  OpenNMT: Open-Source Toolkit for Neural Machine Translation , 2017, ACL.

[11]  Michael S. Bernstein,et al.  In Search of the Dream Team: Temporally Constrained Multi-Armed Bandits for Identifying Effective Team Structures , 2018, CHI.

[12]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[13]  Lingjia Tang,et al.  Outlier Detection for Improved Data Quality and Diversity in Dialog Systems , 2019, NAACL.

[14]  Hiroyuki Kitagawa,et al.  Skill-and-Stress-Aware Assignment of Crowd-Worker Groups to Task Streams , 2018, HCOMP.

[15]  Denny Britz,et al.  Generating Long and Diverse Responses with Neural Conversation Models , 2017, ArXiv.

[16]  Stephen Clark,et al.  Latent Variable Dialogue Models and their Diversity , 2017, EACL.

[17]  Alan Ritter,et al.  Adversarial Learning for Neural Dialogue Generation , 2017, EMNLP.

[18]  Tong Liu,et al.  Learning to Predict Population-Level Label Distributions , 2019, WWW.

[19]  Lingjia Tang,et al.  Data Collection for Dialogue System: A Startup Perspective , 2018, NAACL-HLT.

[20]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[21]  Sihem Amer-Yahia,et al.  Task assignment optimization in knowledge-intensive crowdsourcing , 2015, The VLDB Journal.

[22]  Jianfeng Gao,et al.  A Diversity-Promoting Objective Function for Neural Conversation Models , 2015, NAACL.

[23]  Benjamin B. Bederson,et al.  Web workers unite! addressing challenges of online laborers , 2011, CHI Extended Abstracts.

[24]  Y-Lan Boureau,et al.  Towards Empathetic Open-domain Conversation Models: A New Benchmark and Dataset , 2018, ACL.

[25]  Joelle Pineau,et al.  A Hierarchical Latent Variable Encoder-Decoder Model for Generating Dialogues , 2016, AAAI.

[26]  Ari Kobren,et al.  Getting More for Less: Optimized Crowdsourcing with Dynamic Tasks and Goals , 2015, WWW.

[27]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[28]  Zhe Gan,et al.  Generating Informative and Diverse Conversational Responses via Adversarial Information Maximization , 2018, NeurIPS.

[29]  Mausam,et al.  Active Learning with Unbalanced Classes and Example-Generation Queries , 2018, HCOMP.

[30]  Maxine Eskénazi,et al.  Learning Discourse-level Diversity for Neural Dialog Models using Conditional Variational Autoencoders , 2017, ACL.