DiSeg 1.0: The first system for Spanish discourse segmentation

Nowadays discourse parsing is a very prominent research topic. However, there is not a discourse parser for Spanish texts. The first stage in order to develop this tool is discourse segmentation. In this work, we present DiSeg, the first discourse segmenter for Spanish, which uses the framework of Rhetorical Structure Theory and is based on lexical and syntactic rules. We describe the system and we evaluate its performance against a gold standard corpus, divided in a medical and a terminological subcorpus. We obtain promising results, which means that discourse segmentation is possible using shallow parsing.

[1]  Maria das Graças Volpe Nunes,et al.  On the Development and Evaluation of a Brazilian Portuguese Discourse Parser , 2008, RITA.

[2]  Maite Taboada,et al.  Applications of Rhetorical Structure Theory , 2006 .

[3]  Daniel Marcu,et al.  The Automatic Translation of Discourse Structures , 2000, ANLP.

[4]  Daniel Marcu,et al.  The rhetorical parsing of unrestricted texts: a surface-based approach , 2000, CL.

[5]  Nuno J. Mamede,et al.  Proceedings of the Third International Conference on Advances in Natural Language Processing , 2002 .

[6]  Dragomir R. Radev A Common Theory of Information Fusion from Multiple Text Sources Step One: Cross-Document Structure , 2000, SIGDIAL Workshop.

[7]  Eduard Hovy,et al.  Aspects of Automated Natural Language Generation , 1992, Lecture Notes in Computer Science.

[8]  William C. Mann,et al.  Rhetorical Structure Theory: Toward a functional theory of text organization , 1988 .

[9]  Iria da Cunha,et al.  Comparing rhetorical structures in different languages: The influence of translation strategies , 2010 .

[10]  Giovanni Coray,et al.  ROSETTA: Rhetorical and semantic environment for text alignment , 2001 .

[11]  Daniel Marcu,et al.  Sentence Level Discourse Parsing using Syntactic and Lexical Information , 2003, NAACL.

[12]  Eduard H. Hovy,et al.  Automated Discourse Generation Using Discourse Structure Relations , 1993, Artif. Intell..

[13]  Kenji Ono,et al.  A Discourse Structure Analyzer for Japanese Text , 1992, Fifth Generation Computer Systems.

[14]  Montserrat Civit Torruella Criterios de etiquetación y desambiguación morfosintáctica de corpus en español , 2003 .

[15]  C. Mellish,et al.  ILEX: an architecture for a dynamic hypertext generation system , 2001, Natural Language Engineering.

[16]  Eric SanJuan,et al.  A New Hybrid Summarizer Based on Vector Space Model, Statistical Physics and Linguistics , 2007, MICAI.

[17]  Maite Taboada,et al.  A Syntactic and Lexical-Based Discourse Segmenter , 2009, ACL.

[18]  Maki Watanabe,et al.  Discourse Tagging Reference Manual , 2001 .

[19]  Samuel Reese,et al.  FreeLing 2.1: Five Years of Open-source Language Processing Tools , 2010, LREC.

[20]  Thiago Alexandre Salgueiro Pardo,et al.  DMSumm: Review and Assessment , 2002, PorTAL.

[21]  Pascal Denis,et al.  Learning Recursive Segments for Discourse Parsing , 2010, LREC.

[22]  L. A. Alemany Representing discourse for automatic text summarization via shallow nlp techinques , 2005 .

[23]  Irene Castellón Masalles,et al.  Syntactic parsing of unrestricted Spanish text , 1998 .

[24]  Lluís Padró,et al.  FreeLing 1.3: Syntactic and semantic services in an open-source NLP library , 2006, LREC.