PLATO-XL: Exploring the Large-scale Pre-training of Dialogue Generation

To explore the limit of dialogue generation pre-training, we present the models of PLATO-XL with up to 11 billion parameters, trained on both Chinese and English social media conversations. To train such large models, we adopt the architecture of unified transformer with high computation and parameter efficiency. In addition, we carry out multi-party aware pre-training to better distinguish the characteristic information in social media conversations. With such designs, PLATO-XL successfully achieves superior performances as compared to other approaches in both Chinese and English chitchat. We further explore the capacity of PLATO-XL on other conversational tasks, such as knowledge grounded dialogue and task-oriented conversation. The experimental results indicate that PLATO-XL obtains stateof-the-art results across multiple conversational tasks, verifying its potential as a foundation model of conversational AI.

[1]  Xiyuan Zhang,et al.  Proactive Human-Machine Conversation with Explicit Conversation Goal , 2019, ACL.

[2]  Yuta Tsuboi,et al.  Addressee and Response Selection for Multi-Party Conversation , 2016, EMNLP.

[3]  Yong Jiang,et al.  A Large-Scale Chinese Short-Text Conversation Dataset , 2020, NLPCC.

[4]  Olatunji Ruwase,et al.  ZeRO: Memory optimizations Toward Training Trillion Parameter Models , 2020, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.

[5]  Lihong Li,et al.  Neural Approaches to Conversational AI , 2019, Found. Trends Inf. Retr..

[6]  Tianqi Chen,et al.  Training Deep Nets with Sublinear Memory Cost , 2016, ArXiv.

[7]  Rui Zhang,et al.  Addressee and Response Selection in Multi-Party Conversations with Speaker Interaction RNNs , 2017, AAAI.

[8]  Dilek Z. Hakkani-Tür,et al.  Overview of the Ninth Dialog System Technology Challenge: DSTC9 , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[9]  Fabio Petroni,et al.  Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , 2020, NeurIPS.

[10]  Myle Ott,et al.  fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[11]  Jeremy Blackburn,et al.  The Pushshift Reddit Dataset , 2020, ICWSM.

[12]  Mihail Eric,et al.  MultiWOZ 2. , 2019 .

[13]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[14]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[15]  Hua Wu,et al.  Know More about Each Other: Evolving Dialogue Strategy via Compound Assessment , 2019, ACL.

[16]  Xiaoyan Zhu,et al.  EVA: An Open-Domain Chinese Dialogue System with Large-Scale Generative Pre-Training , 2021, ArXiv.

[17]  Jianfeng Gao,et al.  Deep Reinforcement Learning for Dialogue Generation , 2016, EMNLP.

[18]  Zhengxiao Du,et al.  All NLP Tasks Are Generation Tasks: A General Pretraining Framework , 2021, arXiv.org.

[19]  Jianfeng Gao,et al.  DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation , 2020, ACL.

[20]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[21]  Zhoujun Li,et al.  Learning to Copy Coherent Knowledge for Response Generation , 2021, AAAI.

[22]  Colin Raffel,et al.  How Much Knowledge Can You Pack into the Parameters of a Language Model? , 2020, EMNLP.

[23]  Zhen Guo,et al.  PLATO-2: Towards Building an Open-Domain Chatbot via Curriculum Learning , 2020, ArXiv.

[24]  Hao Tian,et al.  ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation , 2021, ArXiv.

[25]  Huanqi Cao,et al.  CPM: A Large-scale Generative Chinese Pre-trained Language Model , 2020, AI Open.

[26]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[27]  Harry Shum,et al.  The Design and Implementation of XiaoIce, an Empathetic Social Chatbot , 2018, CL.

[28]  Jindong Chen,et al.  MultiWOZ 2.2 : A Dialogue Dataset with Additional Annotation Corrections and State Tracking Baselines , 2020, NLP4CONVAI.

[29]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[30]  Alec Radford,et al.  Scaling Laws for Neural Language Models , 2020, ArXiv.

[31]  Gary Marcus,et al.  The Next Decade in AI: Four Steps Towards Robust Artificial Intelligence , 2020, ArXiv.

[32]  Quoc V. Le,et al.  Towards a Human-like Open-Domain Chatbot , 2020, ArXiv.

[33]  Zhiyuan Liu,et al.  CPM-2: Large-scale Cost-effective Pre-trained Language Models , 2021, AI Open.

[34]  Huang He,et al.  Learning to Select External Knowledge With Multi-Scale Negative Sampling , 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[35]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[36]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[37]  Mary Williamson,et al.  Can You Put it All Together: Evaluating Conversational Agents’ Ability to Blend Skills , 2020, ACL.

[38]  Dilek Z. Hakkani-Tür,et al.  Beyond Domain APIs: Task-oriented Conversational Modeling with Unstructured Knowledge Access , 2020, SIGDIAL.

[39]  Hua Wu,et al.  PLATO: Pre-trained Dialogue Generation Model with Discrete Latent Variable , 2020, ACL.

[40]  Philip S. Yu,et al.  Find or Classify? Dual Strategy for Slot-Value Predictions on Multi-Domain Dialog State Tracking , 2019, STARSEM.

[41]  Nan Duan,et al.  ProphetNet-X: Large-Scale Pre-training Models for English, Chinese, Multi-lingual, Dialog, and Code Generation , 2021, ArXiv.

[42]  Joelle Pineau,et al.  How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation , 2016, EMNLP.

[43]  Mary Williamson,et al.  Recipes for Building an Open-Domain Chatbot , 2020, EACL.