Improving Crowdsourcing-Based Annotation of Japanese Discourse Relations

Although discourse parsing is an important and fundamental task in natural language processing, few languages have corpora annotated with discourse relations and if any, they are small in size. Creating a new corpus of discourse relations by hand is costly and time-consuming. To cope with this problem, Kawahara et al. (2014) constructed a Japanese corpus with discourse annotations through crowdsourcing. However, they did not evaluate the quality of the annotation. In this paper, we evaluate the quality of the annotation using expert annotations. We find out that crowdsourcing-based annotation still leaves much room for improvement. Based on the error analysis, we propose improvement techniques based on language tests. We re-annotated the corpus with discourse annotations using the improvement techniques, and achieved approximately 3% improvement in F-measure. We will make re-annotated data publicly available.

[1]  Daisuke Kawahara,et al.  Building a Diverse Document Leads Corpus Annotated with Semantic Relations , 2012, PACLIC.

[2]  Deniz Zeyrek,et al.  Turkish Discourse Bank: Porting a discourse annotation style to a morphologically rich language , 2013, Dialogue Discourse.

[3]  Alex Lascarides,et al.  Logics of Conversation , 2005, Studies in natural language processing.

[4]  Javier R. Movellan,et al.  Whose Vote Should Count More: Optimal Integration of Labels from Labelers of Unknown Expertise , 2009, NIPS.

[5]  Brendan T. O'Connor,et al.  Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks , 2008, EMNLP.

[6]  Yuping Zhou,et al.  PDTB-style Discourse Annotation of Chinese Text , 2012, ACL.

[7]  Livio Robaldo,et al.  The Penn Discourse TreeBank 2.0. , 2008, LREC.

[8]  Rashmi Prasad,et al.  Reflections on the Penn Discourse TreeBank, Comparable Corpora, and Complementary Annotation , 2014, CL.

[9]  Nicolas Lefebvre,et al.  Crowdsourcing Complex Language Resources: Playing to Annotate Dependency Syntax , 2016, COLING.

[10]  Daisuke Bekki,et al.  Building a Japanese Corpus of Temporal-Causal-Discourse Structures Based on SDRT for Extracting Causal Relations , 2014, EACL 2014.

[11]  Daisuke Kawahara,et al.  Rapid Development of a Corpus with Discourse Annotations using Two-stage Crowdsourcing , 2014, COLING.

[12]  Makoto Nagao,et al.  A Syntactic Analysis Method of Long Japanese Sentences Based on the Detection of Conjunctive Structures , 1994, CL.

[13]  Daniel Marcu,et al.  Building a Discourse-Tagged Corpus in the Framework of Rhetorical Structure Theory , 2001, SIGDIAL Workshop.