CSTNews - A Discourse-Annotated Corpus for Single and Multi-Document Summarization of News Texts in Brazilian Portuguese

Summary. This paper introduces CSTNews, a discourse-annotated corpus for fostering research on single and multi-document summarization. The corpus comprises 50 clusters of news texts in Brazilian Portuguese and some related material, which includes a set of single-document manual summaries and a set of multi-document manual and automatic summaries. The texts are annotated in different ways for discourse organization, following both the Rhetorical Structure Theory and Cross-document Structure Theory. The corpus is a result delivered within the context of the SUCINTO Project, which aims at investigating summarization strategies and developing tools and resources for that purpose. The design of the discourse annotation tasks and the decisions that have been taken during the annotation process are detailed in this paper.

[1]  William C. Mann,et al.  RHETORICAL STRUCTURE THEORY: A THEORY OF TEXT ORGANIZATION , 1987 .

[2]  Thiago Alexandre Salgueiro Pardo,et al.  Métodos para análise discursiva automática , 2005 .

[3]  Michael ODonnell,et al.  RSTTool 2.4 - A markup Tool for Rhetorical Structure Theory , 2000, INLG.

[4]  Constantin Halatsis,et al.  Using synchronic and diachronic relations for summarizing multiple documents describing evolving events , 2007, Journal of Intelligent Information Systems.

[5]  Maki Watanabe,et al.  Discourse Tagging Reference Manual , 2001 .

[6]  Daniel Marcu,et al.  The rhetorical parsing of unrestricted texts: a surface-based approach , 2000, CL.

[7]  Thiago A. S. Pardo,et al.  Experiments with CST-Based Multidocument Summarization , 2010, TextGraphs@ACL.

[8]  Daniel Marcu,et al.  The rhetorical parsing, summarization, and generation of natural language texts , 1998 .

[9]  Thiago Alexandre Salgueiro Pardo,et al.  Finding related sentences in multiple documents for multidocument discourse parsing of Brazilian Portuguese texts , 2008, WebMedia.

[10]  Manfred Stede,et al.  The Potsdam Commentary Corpus , 2004, ACL 2004.

[11]  Dragomir R. Radev,et al.  Learning cross-document structural relationships using boosting , 2003, CIKM '03.

[12]  Maria Lucía del Rosario,et al.  Multi-Document Summarization Using Complex and Rich Features , 2010 .

[13]  Randall Hagner Trigg,et al.  A network-based approach to text handling for the on-line scientific community , 1983 .

[14]  G. Meade Building a Discourse-Tagged Corpus in the Framework of Rhetorical Structure Theory , 2001 .

[15]  Kathleen R. McKeown,et al.  Generating natural language summaries from multiple on-line sources , 1998 .

[16]  Dragomir R. Radev,et al.  Combining Labeled and Unlabeled Data for Learning Cross-Document Structural Relationships , 2004, IJCNLP.

[17]  Mark Weiser,et al.  TEXTNET: a network-based approach to text handling , 1986, TOIS.

[18]  Eloize Rossi Marques Seno,et al.  Co-referential chaining for coherent summaries through rhetorical and linguistic modeling , 2005 .

[19]  Thiago Alexandre Salgueiro Pardo,et al.  A Two-Step Summarizer of Brazilian Portuguese Texts , 2006 .

[20]  Dragomir R. Radev A Common Theory of Information Fusion from Multiple Text Sources Step One: Cross-Document Structure , 2000, SIGDIAL Workshop.

[21]  Florian Wolf,et al.  Coherence in Natural Language: Data Structures and Applications , 2006 .

[22]  M. Maybury,et al.  Automatic Summarization , 2002, Computational Linguistics.

[23]  Jean Carletta,et al.  Assessing Agreement on Classification Tasks: The Kappa Statistic , 1996, CL.

[24]  Livio Robaldo,et al.  The Penn Discourse TreeBank 2.0. , 2008, LREC.

[25]  Zhu Zhang,et al.  Towards CST-enhanced summarization , 2002, AAAI/IAAI.

[26]  M. Taboada,et al.  Discourse relations reference corpus , 2008 .

[27]  Maria das Graças Volpe Nunes,et al.  A comprehensive comparative evaluation of RST-based summarization methods , 2010, TSLP.

[28]  Gerardo Sierra,et al.  On the Development of the RST Spanish Treebank , 2011, Linguistic Annotation Workshop.

[29]  Erick Galani Maziero,et al.  Identifying Multidocument Relations , 2010, NLPCS.