Rapid Development of a Corpus with Discourse Annotations using Two-stage Crowdsourcing

We present a novel approach for rapidly developing a corpus with discourse annotations using crowdsourcing. Although discourse annotations typically require much time and cost owing to their complex nature, we realize discourse annotations in an extremely short time while retaining good quality of the annotations by crowdsourcing two annotation subtasks. In fact, our experiment to create a corpus comprising 30,000 Japanese sentences took less than eight hours to run. Based on this corpus, we also develop a supervised discourse parser and evaluate its performance to verify the usefulness of the acquired corpus.

[1]  Livio Robaldo,et al.  The Penn Discourse TreeBank 2.0. , 2008, LREC.

[2]  Sadao Kurohashi,et al.  Automatic Slide Generation Based on Discourse Structure Analysis , 2005, IJCNLP.

[3]  Mitsuru Ishizuka,et al.  HILDA: A Discourse Parser Using Support Vector Machine Classification , 2010, Dialogue Discourse.

[4]  Richard Johansson,et al.  End-to-End Discourse Parser Evaluation , 2011, 2011 IEEE Fifth International Conference on Semantic Computing.

[5]  Ani Nenkova,et al.  Using Syntax to Disambiguate Explicit Discourse Connectives in Text , 2009, ACL.

[6]  John Mark Agosta,et al.  Highlighting disputed claims on the web , 2010, WWW '10.

[7]  Shafiq R. Joty,et al.  Combining Intra- and Multi-sentential Rhetorical Parsing for Document-level Discourse Analysis , 2013, ACL.

[8]  Edward Gibson,et al.  Representing Discourse Coherence: A Corpus-Based Study , 2005, CL.

[9]  Jisup Hong,et al.  How Good is the Crowd at "real" WSD? , 2011, Linguistic Annotation Workshop.

[10]  Zheng-Yu Niu,et al.  Leveraging Synthetic Discourse Data via Multi-task Learning for Implicit Discourse Relation Recognition , 2013, ACL.

[11]  Heiner Stuckenschmidt,et al.  Fine-Grained Sentiment Analysis with Structural Features , 2011, IJCNLP.

[12]  Barbara Di Eugenio,et al.  An effective Discourse Parser that uses Rich Linguistic Information , 2009, NAACL.

[13]  Javier R. Movellan,et al.  Whose Vote Should Count More: Optimal Integration of Labels from Labelers of Unknown Expertise , 2009, NIPS.

[14]  Shafiq R. Joty,et al.  A Novel Discriminative Framework for Sentence-Level Discourse Analysis , 2012, EMNLP.

[15]  Brendan T. O'Connor,et al.  Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks , 2008, EMNLP.

[16]  James H. Martin,et al.  Building a Corpus of Temporal-Causal Structure , 2008, LREC.

[17]  Daisuke Kawahara,et al.  Building a Diverse Document Leads Corpus Annotated with Semantic Relations , 2012, PACLIC.

[18]  Kathleen McKeown,et al.  Aggregated Word Pair Features for Implicit Discourse Relation Disambiguation , 2013, ACL.

[19]  Graeme Hirst,et al.  Text-level Discourse Parsing with Rich Linguistic Features , 2012, ACL.

[20]  Claudio Giuliano,et al.  Outsourcing FrameNet to the Crowd , 2013, ACL.

[21]  Gary Geunbae Lee,et al.  Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , 2012, ACL 2012.

[22]  Gerardo Sierra,et al.  On the Development of the RST Spanish Treebank , 2011, Linguistic Annotation Workshop.

[23]  Daniel Marcu,et al.  Building a Discourse-Tagged Corpus in the Framework of Rhetorical Structure Theory , 2001, SIGDIAL Workshop.

[24]  W. Bruce Croft,et al.  Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , 2013 .

[25]  Ani Nenkova,et al.  Automatic sense prediction for implicit discourse relations in text , 2009, ACL.

[26]  Manfred Stede,et al.  The Potsdam Commentary Corpus , 2004, ACL 2004.

[27]  Hiroshi Nakagawa,et al.  Exact Passive-Aggressive Algorithm for Multiclass Classification Using Support Class , 2010, SDM.

[28]  Daisuke Bekki,et al.  Building a Japanese Corpus of Temporal-Causal-Discourse Structures Based on SDRT for Extracting Causal Relations , 2014, EACL 2014.

[29]  Matteo Negri,et al.  Divide and Conquer: Crowdsourcing the Creation of Cross-Lingual Textual Entailment Corpora , 2011, EMNLP.

[30]  Yuji Matsumoto,et al.  Statement map: assisting information crediblity analysis by visualizing arguments , 2009, WICOW.

[31]  Masaru Kitsuregawa,et al.  Kernel Slicing: Scalable Online Training with Conjunctive Features , 2010, COLING.

[32]  Hwee Tou Ng,et al.  A PDTB-styled end-to-end discourse parser , 2012, Natural Language Engineering.