论文信息 - A Large-Scale Corpus of E-mail Conversations with Standard and Two-Level Dialogue Act Annotations

A Large-Scale Corpus of E-mail Conversations with Standard and Two-Level Dialogue Act Annotations

We present a large-scale corpus of e-mail conversations with domain-agnostic and two-level dialogue act (DA) annotations towards the goal of a better understanding of asynchronous conversations. We annotate over 6,000 messages and 35,000 sentences from more than 2,000 threads. For a domain-independent and application-independent DA annotations, we choose ISO standard 24617-2 as the annotation scheme. To assess the difficulty of DA recognition on our corpus, we evaluate several models, including a pre-trained contextual representation model, as our baselines. The experimental results show that BERT outperforms other neural network models, including previous state-of-the-art models, but falls short of a human performance. We also demonstrate that DA tags of two-level granularity enable a DA recognition model to learn efficiently by using multi-task learning. An evaluation of a model trained on our corpus against other domains of asynchronous conversation reveals the domain independence of our DA annotations.

[1] John J. Godfrey,et al. SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2] Mona T. Diab,et al. Multi-Domain Goal-Oriented Dialogues (MultiDoGO): Strategies toward Curating and Annotating Large Scale Dialogue Data , 2019, EMNLP.

[3] Shafiq R. Joty,et al. Speech Act Modeling of Written Asynchronous Conversations with Task-Specific Embeddings and Conditional Structured Models , 2016, ACL.

[4] G. Carenini,et al. A Publicly Available Annotated Corpus for Supervised Email Summarization , 2008 .

[5] Yiming Yang,et al. The Enron Corpus: A New Dataset for Email Classi(cid:12)cation Research , 2004 .

[6] Jihie Kim,et al. Learning to Detect Conversation Focus of Threaded Discussions , 2006, NAACL.

[7] Mark G. Core,et al. Coding Dialogs with the DAMSL Annotation Scheme , 1997 .

[8] Kôiti Hasida,et al. ISO 24617-2: A semantically-based standard for dialogue annotation , 2012, LREC.

[9] Elizabeth Shriberg,et al. Meeting Recorder Project: Dialog Act Labeling Guide , 2004 .

[10] Elizabeth Shriberg,et al. The ICSI Meeting Recorder Dialog Act (MRDA) Corpus , 2004, SIGDIAL Workshop.

[11] Prasenjit Mitra,et al. Summarizing Online Forum Discussions – Can Dialog Acts of Individual Messages Help? , 2014, EMNLP.

[12] Shay B. Cohen,et al. Conversation Trees: A Grammar Model for Topic Structure in Forums , 2015, EMNLP.

[13] Andrew McCallum,et al. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.