Towards a top-down approach for an automatic discourse analysis for Basque: Segmentation and Central Unit detection tool

Lately, discourse structure has received considerable attention due to the benefits its application offers in several NLP tasks such as opinion mining, summarization, question answering, text simplification, among others. When automatically analyzing texts, discourse parsers typically perform two different tasks: i) identification of basic discourse units (text segmentation) ii) linking discourse units by means of discourse relations, building structures such as trees or graphs. The resulting discourse structures are, in general terms, accurate at intra-sentence discourse-level relations, however they fail to capture the correct inter-sentence relations. Detecting the main discourse unit (the Central Unit) is helpful for discourse analyzers (and also for manual annotation) in improving their results in rhetorical labeling. Bearing this in mind, we set out to build the first two steps of a discourse parser following a top-down strategy: i) to find discourse units, ii) to detect the Central Unit. The final step, i.e. assigning rhetorical relations, remains to be worked on in the immediate future. In accordance with this strategy, our paper presents a tool consisting of a discourse segmenter and an automatic Central Unit detector.

[1]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[2]  Juliano Desiderato,et al.  Detecting central units in argumentative answer genre : signals that influence annotators ' agreement , 2015 .

[3]  David Sheskin The Friedman Two-Way Analysis of Variance by Ranks , 2003 .

[4]  Maite Taboada,et al.  A Syntactic and Lexical-Based Discourse Segmenter , 2009, ACL.

[5]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[6]  Michael ODonnell,et al.  RSTTool 2.4 - A markup Tool for Rhetorical Structure Theory , 2000, INLG.

[7]  Milan Straka,et al.  Tokenizing, POS Tagging, Lemmatizing and Parsing UD 2.0 with UDPipe , 2017, CoNLL.

[8]  Anders Søgaard,et al.  Cross-lingual RST Discourse Parsing , 2017, EACL.

[9]  Daniel Marcu,et al.  The rhetorical parsing of unrestricted texts: a surface-based approach , 2000, CL.

[10]  Bonnie L. Webber,et al.  Discourse Structure and Computation: Past, Present and Future , 2012, Discoveries@ACL.

[11]  Arantza Díaz de Ilarraza,et al.  Unidad discursiva y relaciones retóricas: un estudio acerca de las unidades de discurso en el etiquetado de un corpus en euskera , 2011, Proces. del Leng. Natural.

[12]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[13]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[14]  Kwee Tjoe Liong Adverbial clauses, Functional Grammar, and the change from sentence grammar to discourse-text grammar , 2000 .

[15]  David E. Losada,et al.  Rhetorical Structure Theory for polarity estimation: An experimental study , 2014, Data Knowl. Eng..

[16]  Maite Taboada,et al.  A qualitative comparison method for rhetorical structures: identifying different discourse structures in multilingual corpora , 2015, Lang. Resour. Evaluation.

[17]  Ani Nenkova,et al.  Discourse indicators for content selection in summarization , 2010, SIGDIAL Conference.

[18]  Hardarik Blühdorn Subordination and coordination in syntax, semantics and discourse. Evidence from the study of connectives , 2008 .

[19]  H. Damasio,et al.  IEEE Transactions on Pattern Analysis and Machine Intelligence: Special Issue on Perceptual Organization in Computer Vision , 1998 .

[20]  N. H. van der Vliet,et al.  Syntax-based Discourse Segmentation of Dutch Text , 2010 .

[21]  F. Rudzicz Human Language Technologies : The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics , 2010 .

[22]  Oier Lopez de Lacalle,et al.  The RST Basque TreeBank : an online search interface to check rhetorical relations , 2013 .

[23]  Siddharth Patwardhan,et al.  The Role of Context Types and Dimensionality in Learning Word Embeddings , 2016, NAACL.

[24]  Arantza Díaz de Ilarraza,et al.  Multilingual segmentation based on neural networks and pre-trained word embeddings , 2019 .

[25]  Klaus Krippendorff,et al.  Content Analysis: An Introduction to Its Methodology , 1980 .

[26]  William C. Mann,et al.  Rhetorical Structure Theory: Toward a functional theory of text organization , 1988 .

[27]  Arantza Díaz de Ilarraza,et al.  The annotation of the Central Unit in Rhetorical Structure Trees: A Key Step in Annotating Rhetorical Relations , 2014, COLING.

[28]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[29]  Nan Yu,et al.  Transition-based Neural RST Parsing with Implicit Syntax Features , 2018, COLING.

[30]  Chloé Braud,et al.  ToNy: Contextual embeddings for accurate multilingual discourse segmentation of full documents , 2019 .

[31]  M. Taboada,et al.  Discourse structure and attitudinal valence of opinion words in sentiment extraction , 2014 .

[32]  Lou Boves,et al.  Discourse-based answering of why-questions , 2006, Trait. Autom. des Langues.

[33]  Peter Bourgonje,et al.  Multi-lingual and Cross-genre Discourse Unit Segmentation , 2019 .

[34]  Roland Vollgraf,et al.  Pooled Contextualized Embeddings for Named Entity Recognition , 2019, NAACL.

[35]  Guillaume Lample,et al.  Neural Architectures for Named Entity Recognition , 2016, NAACL.

[36]  Pascal Denis,et al.  Learning Recursive Segments for Discourse Parsing , 2010, LREC.

[37]  Gorka Labaka,et al.  Detecting the central units in two different genres and languages: a preliminary study of Brazilian Portuguese and Basque texts , 2016, Proces. del Leng. Natural.

[38]  Yue Zhang,et al.  NCRF++: An Open-source Neural Sequence Labeling Toolkit , 2018, ACL.

[39]  Kepa Bengoetxea,et al.  Detecting the Central Units of Brazilian Portuguese argumentative answer texts , 2018, Proces. del Leng. Natural.

[40]  Shafiq R. Joty,et al.  CODRA: A Novel Discriminative Framework for Rhetorical Analysis , 2015, CL.

[41]  Hans Peter Luhn,et al.  The Automatic Creation of Literature Abstracts , 1958, IBM J. Res. Dev..

[42]  Iñigo Lopez-Gazpio,et al.  Two Approaches to Generate Questions in Basque , 2013, Proces. del Leng. Natural.

[43]  Johan Bos,et al.  Open-Domain Semantic Parsing with Boxer , 2015, NODALIDA.

[44]  Eric SanJuan,et al.  DiSeg 1.0: The first system for Spanish discourse segmentation , 2012, Expert Syst. Appl..

[45]  Kepa Bengoetxea,et al.  A Supervised Central Unit Detector for Spanish , 2018, Proces. del Leng. Natural.

[46]  Ludmila I. Kuncheva,et al.  Relationships between combination methods and measures of diversity in combining classifiers , 2002, Inf. Fusion.

[47]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[48]  Hwee Tou Ng,et al.  A PDTB-styled end-to-end discourse parser , 2012, Natural Language Engineering.

[49]  Luke S. Zettlemoyer,et al.  Dissecting Contextual Word Embeddings: Architecture and Representation , 2018, EMNLP.

[50]  Mikel Iruskieta Quintian,et al.  A Supervised Central Unit Detector for Spanish , 2018 .

[51]  Oleksandr Makeyev,et al.  Neural network with ensembles , 2010, The 2010 International Joint Conference on Neural Networks (IJCNN).

[52]  M. J. Aranzabe,et al.  Automatic Conversion of the Basque Dependency Treebank to Universal Dependencies , 2015 .

[53]  Arantza Díaz de Ilarraza Sánchez,et al.  Unidad discursiva y relaciones retóricas: un estudio acerca de las unidades de discurso en el etiquetado de un corpus en euskera , 2011 .

[54]  Daniel Marcu,et al.  Towards Automatic Classification of Discourse Elements in Essays , 2001, ACL.

[55]  Mirella Lapata,et al.  Learning Contextually Informed Representations for Linear-Time Discourse Parsing , 2017, EMNLP.

[56]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[57]  Igor Leturia Evaluating Different Methods for Automatically Collecting Large General Corpora for Basque from the Web , 2012, COLING.

[58]  Beñat Zapirain,et al.  EusEduSeg: A Dependency-Based EDU Segmentation for Basque , 2015, Proces. del Leng. Natural.

[59]  Daniel Marcu,et al.  Sentence Level Discourse Parsing using Syntactic and Lexical Information , 2003, NAACL.

[60]  Debopam Das,et al.  The DISRPT 2019 Shared Task on Elementary Discourse Unit Segmentation and Connective Detection , 2019 .

[61]  Ye Zhang,et al.  A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification , 2015, IJCNLP.

[62]  Prakhar Gupta,et al.  Learning Word Vectors for 157 Languages , 2018, LREC.

[63]  Gerardo Sierra,et al.  A Symbolic Approach for Automatic Detection of Nuclearity and Rhetorical Relations among Intra-sentence Discourse Segments in Spanish , 2012, CICLing.

[64]  Koldo Gojenola,et al.  Application of Different Techniques to Dependency Parsing of Basque , 2010, SPMRL@NAACL-HLT.

[65]  C. Lehmann Towards a typology of clause linkage , 1988 .

[66]  M. Quintian Pragmatikako erlaziozko diskurtso-egitura: deskribapena eta bere ebaluazioa hizkuntzalaritza konputazionalean , 2014 .

[67]  M ChenloJosé,et al.  Rhetorical Structure Theory for polarity estimation , 2014 .

[68]  Chris D. Paice,et al.  The automatic generation of literature abstracts: an approach based on the identification of self-indicating phrases , 1980, SIGIR '80.

[69]  Daniel Marcu,et al.  An Unsupervised Approach to Recognizing Discourse Relations , 2002, ACL.

[70]  Daniel Marcu,et al.  The rhetorical parsing, summarization, and generation of natural language texts , 1998 .

[71]  Rashmi Prasad,et al.  The Penn Discourse Treebank , 2004, LREC.

[72]  Manfred Stede,et al.  Discourse Segmentation of German Texts , 2015, J. Lang. Technol. Comput. Linguistics.

[73]  Hinrich Schütze,et al.  Introduction to Information Retrieval: Relevance feedback and query expansion , 2008 .

[74]  Yan Liu,et al.  GumDrop at the DISRPT2019 Shared Task: A Model Stacking Approach to Discourse Unit Segmentation and Connective Detection , 2019, ArXiv.

[75]  Aitziber Atutxa,et al.  Un detector de la unidad central de un texto basado en técnicas de aprendizaje automático en textos científicos para el euskera , 2017, Proces. del Leng. Natural.

[76]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[77]  Maria das Graças Volpe Nunes,et al.  On the Development and Evaluation of a Brazilian Portuguese Discourse Parser , 2008, RITA.

[78]  I. Zabala,et al.  INTERDISCIPLINARY TRAINING ASSESSMENT OF COMMUNICATION SKILLS FOR STUDENTS WITH BASQUE AS INSTRUCTION LANGUAGE IN THE FACULTY OF SCIENCE AND TECHNOLOGY AT UPV/EHU UNIVERSITY , 2018, EDULEARN18 Proceedings.

[79]  Daniel Marcu,et al.  A Machine Learning Approach for Identification Thesis and Conclusion Statements in Student Essays , 2003, Comput. Humanit..

[80]  Alex Lascarides,et al.  Logics of Conversation , 2005, Studies in natural language processing.