论文信息 - A Large-Scale Chinese Short-Text Conversation Dataset - 字舞流文

A Large-Scale Chinese Short-Text Conversation Dataset

The advancements of neural dialogue generation models show promising results on modeling short-text conversations. However, training such models usually needs a large-scale high-quality dialogue corpus, which is hard to access. In this paper, we present a large-scale cleaned Chinese conversation dataset, LCCC, which contains a base version (6.8million dialogues) and a large version (12.0 million dialogues). The quality of our dataset is ensured by a rigorous data cleaning pipeline, which is built based on a set of rules and a classifier that is trained on manually annotated 110K dialogue pairs. We also release pre-training dialogue models which are trained on LCCC-base and LCCC-large respectively. The cleaned dataset and the pre-training models will facilitate the research of short-text conversation modeling. All the models and datasets are available at this https URL.

Yong Jiang | Minlie Huang | Xiaoyan Zhu | Yida Wang | Pei Ke | Yinhe Zheng | Kaili Huang

[1] Jörg Tiedemann,et al. OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles , 2016, LREC.

[2] Jianfeng Gao,et al. A Persona-Based Neural Conversation Model , 2016, ACL.

[3] Joelle Pineau,et al. The Ubuntu Dialogue Corpus: A Large Dataset for Research in Unstructured Multi-Turn Dialogue Systems , 2015, SIGDIAL Conference.

[4] Zhoujun Li,et al. Detecting Context Dependent Messages in a Conversational Environment , 2016, COLING.

[5] Hang Li,et al. Neural Responding Machine for Short-Text Conversation , 2015, ACL.

[6] Xiaoyu Shen,et al. DailyDialog: A Manually Labelled Multi-turn Dialogue Dataset , 2017, IJCNLP.

[7] Jason Weston,et al. Wizard of Wikipedia: Knowledge-Powered Conversational agents , 2018, ICLR.

[8] Xuanjing Huang,et al. Pre-trained Models for Natural Language Processing: A Survey , 2020, ArXiv.

[9] Hua Wu,et al. PLATO: Pre-trained Dialogue Generation Model with Discrete Latent Variable , 2020, ACL.

[10] Yejin Choi,et al. The Curious Case of Neural Text Degeneration , 2019, ICLR.

[11] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[12] Joelle Pineau,et al. Hierarchical Neural Network Generative Models for Movie Dialogues , 2015, ArXiv.

[13] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[14] Frank Hutter,et al. Fixing Weight Decay Regularization in Adam , 2017, ArXiv.

[15] Joelle Pineau,et al. A Survey of Available Corpora for Building Data-Driven Dialogue Systems , 2015, Dialogue Discourse.

[16] Jianfeng Gao,et al. Challenges in Building Intelligent Open-domain Dialog Systems , 2019, ACM Trans. Inf. Syst..

[17] Quoc V. Le,et al. Towards a Human-like Open-Domain Chatbot , 2020, ArXiv.

[18] Zhoujun Li,et al. Sequential Match Network: A New Architecture for Multi-turn Response Selection in Retrieval-based Chatbots , 2016, ArXiv.

[19] Antoine Bordes,et al. Training Millions of Personalized Dialogue Agents , 2018, EMNLP.

[20] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[21] Piji Li,et al. An Empirical Investigation of Pre-Trained Transformer Language Models for Open-Domain Dialogue Generation , 2020, ArXiv.

[22] Hinrich Schütze,et al. Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[23] Alan Ritter,et al. Unsupervised Modeling of Twitter Conversations , 2010, NAACL.

[24] Thomas Wolf,et al. TransferTransfo: A Transfer Learning Approach for Neural Network Based Conversational Agents , 2019, ArXiv.

[25] Quoc V. Le,et al. The Evolved Transformer , 2019, ICML.

[26] Alan W. Black,et al. A Dataset for Document Grounded Conversations , 2018, EMNLP.

[27] Joelle Pineau,et al. How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation , 2016, EMNLP.

[28] Quoc V. Le,et al. Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[29] Jianfeng Gao,et al. A Diversity-Promoting Objective Function for Neural Conversation Models , 2015, NAACL.

[30] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[31] Jianfeng Gao,et al. DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation , 2020, ACL.

[32] J. Fleiss. Measuring nominal scale agreement among many raters. , 1971 .

[33] Joelle Pineau,et al. Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models , 2015, AAAI.

[34] Yoshua Bengio,et al. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[35] Jason Weston,et al. Personalizing Dialogue Agents: I have a dog, do you have pets too? , 2018, ACL.

[36] Vasile Rus,et al. An Optimal Assessment of Natural Language Student Input Using Word-to-Word Similarity Metrics , 2012, ITS.

[37] Song Liu,et al. Personalized Dialogue Generation with Diversified Traits , 2019, ArXiv.

[38] Hao Wang,et al. A Dataset for Research on Short-Text Conversations , 2013, EMNLP.

[39] Alec Radford,et al. Improving Language Understanding by Generative Pre-Training , 2018 .

[40] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .

[41] Xiyuan Zhang,et al. Proactive Human-Machine Conversation with Explicit Conversation Goal , 2019, ACL.

[42] Jianfeng Gao,et al. A Neural Network Approach to Context-Sensitive Generation of Conversational Responses , 2015, NAACL.