Exploring Correlation Between ROUGE and Human Evaluation on Meeting Summaries

Automatic summarization evaluation is very important to the development of summarization systems. In text summarization, ROUGE has been shown to correlate well with human evaluation when measuring match of content units. However, there are many characteristics of the multiparty meeting domain, which may pose potential problems to ROUGE. The goal of this paper is to examine how well the ROUGE scores correlate with human evaluation for extractive meeting summarization, and explore different meeting domain specific factors that have an impact on the correlation. More analysis than those in our previous work has been conducted in this study. Our experiments show that generally the correlation between ROUGE and human evaluation is not great; however, when accounting for several unique meeting characteristics, such as disfluencies, speaker information, and stopwords in the ROUGE setting, better correlation can be achieved, especially on the system summaries. We also found that these factors have a different impact on human versus system summaries. In addition, we contrast the results using ROUGE with other automatic summarization evaluation metrics, such as Kappa and Pyramid, and show the appropriateness of using ROUGE for this study.

[1]  Julia Hirschberg,et al.  Comparing lexical, acoustic/prosodic, structural and discourse features for speech summarization , 2005, INTERSPEECH.

[2]  Julia Hirschberg,et al.  Do summaries help? , 2005, SIGIR '05.

[3]  Hans van Halteren,et al.  Evaluating Information Content by Factoid Analysis: Human annotation and stability , 2004, EMNLP.

[4]  Jun-ichi Fukumoto,et al.  Automated Summarization Evaluation with Basic Elements. , 2006, LREC.

[5]  Michel Galley,et al.  A Skip-Chain Conditional Random Field for Ranking Meeting Utterances by Importance , 2006, EMNLP.

[6]  Richard M. Schwartz,et al.  Task-based evaluation of text summarization using Relevance Prediction , 2007, Inf. Process. Manag..

[7]  Lin-shan Lee,et al.  Spoken document understanding and organization , 2005, IEEE Signal Processing Magazine.

[8]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[9]  Gerald Penn,et al.  Evaluation of Sentence Selection for Speech Summarization , 2005 .

[10]  Feifan Liu,et al.  Correlation between ROUGE and Human Evaluation of Extractive Meeting Summaries , 2008, ACL.

[11]  Inderjeet Mani,et al.  The Tipster Summac Text Summarization Evaluation , 1999, EACL.

[12]  Douglas A. Reynolds,et al.  Measuring the readability of automatic speech-to-text transcripts , 2003, INTERSPEECH.

[13]  Sadaoki Furui,et al.  Speech-to-text and speech-to-speech summarization of spontaneous speech , 2004, IEEE Transactions on Speech and Audio Processing.

[14]  Johanna D. Moore,et al.  Evaluating Automatic Summaries of Meeting Recordings , 2005, IEEvaluation@ACL.

[15]  Yang Liu,et al.  Using corpus and knowledge-based similarity measure in Maximum Marginal Relevance for meeting summarization , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[16]  Ani Nenkova,et al.  Evaluating Content Selection in Summarization: The Pyramid Method , 2004, NAACL.

[17]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[18]  Jean-Luc Gauvain,et al.  Combining speaker identification and BIC for speaker diarization , 2005, INTERSPEECH.

[19]  Andreas Stolcke,et al.  The ICSI Meeting Corpus , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[20]  Karen Sparck Jones,et al.  Book Reviews: Evaluating Natural Language Processing Systems: An Analysis and Review , 1996, CL.

[21]  Dragomir R. Radev,et al.  Centroid-based summarization of multiple documents , 2004, Inf. Process. Manag..

[22]  Elizabeth Shriberg,et al.  The ICSI Meeting Recorder Dialog Act (MRDA) Corpus , 2004, SIGDIAL Workshop.

[23]  Jade Goldstein-Stewart,et al.  The use of MMR, diversity-based reranking for reordering documents and producing summaries , 1998, SIGIR '98.

[24]  Julia Hirschberg,et al.  From text to speech summarization , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[25]  Sadaoki Furui,et al.  Evaluation method for automatic speech summarization , 2003, INTERSPEECH.

[26]  Gerald Penn,et al.  Comparing the roles of textual, acoustic and spoken-language features on spontaneous-conversation summarization , 2006, NAACL.