Building a Language‐Independent Discourse Parser using Universal Networking Language

Discourse parsing has become an inevitable task to process information in the natural language processing arena. Parsing complex discourse structures beyond the sentence level is a significant challenge. This article proposes a discourse parser that constructs rhetorical structure (RS) trees to identify such complex discourse structures. Unlike previous parsers that construct RS trees using lexical features, syntactic features and cue phrases, the proposed discourse parser constructs RS trees using high‐level semantic features inherited from the Universal Networking Language (UNL). The UNL also adds a language‐independent quality to the parser, because the UNL represents texts in a language‐independent manner. The parser uses a naive Bayes probabilistic classifier to label discourse relations. It has been tested using 500 Tamil‐language documents and the Rhetorical Structure Theory Discourse Treebank, which comprises 21 English‐language documents. The performance of the naive Bayes classifier has been compared with that of the support vector machine (SVM) classifier, which has been used in the earlier approaches to build a discourse parser. It is seen that the naive Bayes probabilistic classifier is better suited for discourse relation labeling when compared with the SVM classifier, in terms of training time, testing time, and accuracy.

[1]  Nicholas Asher,et al.  Complex discourse units and their semantics , 2011 .

[2]  Graeme Hirst,et al.  Text-level Discourse Parsing with Rich Linguistic Features , 2012, ACL.

[3]  Lou Boves,et al.  Features for automatic discourse analysis of paragraphs , 2008 .

[4]  Dipti Misra Sharma,et al.  Creating an Annotated Tamil Corpus as a Discourse Resource , 2011, Linguistic Annotation Workshop.

[5]  David Reitter,et al.  Simple Signals for Complex Rhetorics: On Rhetorical Analysis with Rich-Feature Support Vector Models , 2003, LDV Forum.

[6]  David Yarowsky,et al.  Unsupervised Word Sense Disambiguation Rivaling Supervised Methods , 1995, ACL.

[7]  Livio Robaldo,et al.  The Penn Discourse TreeBank 2.0. , 2008, LREC.

[8]  Daniel Marcu,et al.  Sentence Level Discourse Parsing using Syntactic and Lexical Information , 2003, NAACL.

[9]  Danushka Bollegala,et al.  A Semi-Supervised Approach to Improve Classification of Infrequent Discourse Relations Using Feature Vector Extension , 2010, EMNLP.

[10]  Barbara Di Eugenio,et al.  An effective Discourse Parser that uses Rich Linguistic Information , 2009, NAACL.

[11]  Alan F. Smeaton,et al.  Segmenting broadcast news streams using lexical chains , 2002 .

[12]  Daniel Marcu,et al.  Building a Discourse-Tagged Corpus in the Framework of Rhetorical Structure Theory , 2001, SIGDIAL Workshop.

[13]  Mitsuru Ishizuka,et al.  HILDA: A Discourse Parser Using Support Vector Machine Classification , 2010, Dialogue Discourse.

[14]  Ranjani Parthasarathi,et al.  A Language Independent Rhetorical Structure Framework Using Universal Networking Language , 2011, Indian International Conference on Artificial Intelligence.

[15]  Christopher Culy,et al.  Sentential Structure and Discourse Parsing , 2004, ACL 2004.

[16]  William C. Mann,et al.  Rhetorical Structure Theory: Toward a functional theory of text organization , 1988 .

[17]  D. Powers Evaluation: From Precision, Recall and F-Factor to ROC, Informedness, Markedness & Correlation , 2008 .

[18]  Jesús Cardeñosa Lera,et al.  The UNL Initiative: An Overview , 2005, CICLing.

[19]  Daniel Marcu,et al.  An Unsupervised Approach to Recognizing Discourse Relations , 2002, ACL.

[20]  T. V. Geetha,et al.  MorphoSemantic Features for Rulebased Tamil Enconversion , 2011 .

[21]  J. Balaji,et al.  Morpho-Semantic Features for Rule-based Tamil Enconversion , 2011 .

[22]  Hideki Kozima,et al.  Text Segmentation Based on Similarity between Words , 1993, ACL.

[23]  Daniel Marcu,et al.  Building Up Rhetorical Structure Trees , 1996, AAAI/IAAI, Vol. 2.

[24]  John A. Bateman,et al.  Rhetorical structure theory , 2006 .