Towards Quantifiable Dialogue Coherence Evaluation

Automatic dialogue coherence evaluation has attracted increasing attention and is crucial for developing promising dialogue systems. However, existing metrics have two major limitations: (a) they are mostly trained in a simplified two-level setting (coherent vs. incoherent), while humans give Likert-type multi-level coherence scores, dubbed as “quantifiable”; (b) their predicted coherence scores cannot align with the actual human rating standards due to the absence of human guidance during training. To address these limitations, we propose Quantifiable Dialogue Coherence Evaluation (QuantiDCE), a novel framework aiming to train a quantifiable dialogue coherence metric that can reflect the actual human rating standards. Specifically, QuantiDCE includes two training stages, Multi-Level Ranking (MLR) pre-training and Knowledge Distillation (KD) fine-tuning. During MLR pre-training, a new MLR loss is proposed for enabling the model to learn the coarse judgement of coherence degrees. Then, during KD fine-tuning, the pretrained model is further finetuned to learn the actual human rating standards with only very few human-annotated data. To advocate the generalizability even with limited fine-tuning data, a novel KD regularization is introduced to retain the knowledge learned at the pre-training stage. Experimental results show that the model trained by QuantiDCE presents stronger correlations with human judgements than the other state-of-the-art metrics.

[1]  Dongyan Zhao,et al.  RUBER: An Unsupervised Method for Automatic Evaluation of Open-Domain Dialog Systems , 2017, AAAI.

[2]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[3]  Xin Jiang,et al.  TinyBERT: Distilling BERT for Natural Language Understanding , 2019, FINDINGS.

[4]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[5]  Verena Rieser,et al.  Why We Need New Evaluation Metrics for NLG , 2017, EMNLP.

[6]  Joelle Pineau,et al.  How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation , 2016, EMNLP.

[7]  Kilian Q. Weinberger,et al.  BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.

[8]  Xiaoyu Shen,et al.  DailyDialog: A Manually Labelled Multi-turn Dialogue Dataset , 2017, IJCNLP.

[9]  Iryna Gurevych,et al.  Dialogue Coherence Assessment Without Explicit Dialogue Act Labels , 2019, ACL.

[10]  Joelle Pineau,et al.  Towards an Automatic Turing Test: Learning to Evaluate Dialogue Responses , 2017, ACL.

[11]  Shujian Huang,et al.  Online Distilling from Checkpoints for Neural Machine Translation , 2019, NAACL.

[12]  Ce Liu,et al.  Supervised Contrastive Learning , 2020, NeurIPS.

[13]  Thibault Sellam,et al.  BLEURT: Learning Robust Metrics for Text Generation , 2020, ACL.

[14]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[15]  Zhangyang Wang,et al.  In Defense of the Triplet Loss Again: Learning Robust Person Re-Identification with Fast Approximated Triplet Loss and Label Distillation , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[16]  Jiajun Zhang,et al.  Distill and Replay for Continual Language Learning , 2020, COLING.

[17]  Joelle Pineau,et al.  The Second Conversational Intelligence Challenge (ConvAI2) , 2019, The NeurIPS '18 Competition.

[18]  Lynda Tamine,et al.  Knowledge Base Embedding By Cooperative Knowledge Distillation , 2020, COLING.

[19]  Mitesh M. Khapra,et al.  Improving Dialog Evaluation with a Multi-reference Adversarial Dataset and Large Scale Pretraining , 2020, Transactions of the Association for Computational Linguistics.

[20]  Y-Lan Boureau,et al.  Towards Empathetic Open-domain Conversation Models: A New Benchmark and Dataset , 2018, ACL.

[21]  Evgeny A. Stepanov,et al.  Coherence Models for Dialogue , 2018, INTERSPEECH.

[22]  Yan Wang,et al.  The World Is Not Binary: Learning to Rank with Grayscale Data for Dialogue Response Selection , 2020, EMNLP.

[23]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[24]  Nanyun Peng,et al.  Better Automatic Evaluation of Open-Domain Dialogue Systems with Contextualized Embeddings , 2019, Proceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation.

[25]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[26]  Sepp Hochreiter,et al.  Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) , 2015, ICLR.