Annotation of subtitle paraphrases using a new web tool

This paper analyzes the manual annotation effort carried out to produce Opusparcus, the Open Subtitles Paraphrase Corpus for six European languages. Within the scope of the project, a new web-based annotation tool was created. We discuss the design choices behind the tool as well as the setup of the annotation task. We also evaluate the annotations obtained. Two independent annotators needed to decide to what extent two sentences approximately meant the same thing. The sentences originate from subtitles from movies and TV shows, which constitutes an interesting genre of mostly colloquial language. Inter-annotator agreement was found to be on par with a well-known previous paraphrase resource from the news domain, the Microsoft Research Paraphrase Corpus (MSRPC). Our annotation tool is open source. The tool can be used for closed projects with restricted access and controlled user authentification as well as open crowdsourced projects, in which anyone can participate and user identification takes place based on IP addresses.

[1]  Brendan T. O'Connor,et al.  Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks , 2008, EMNLP.

[2]  Christof Monz,et al.  Measuring the Effect of Conversational Aspects on Machine Translation Quality , 2016, COLING.

[3]  Maria Salamó,et al.  ETPC - A Paraphrase Identification Corpus Annotated with Extended Paraphrase Typology and Negation , 2018, LREC.

[4]  Jörg Tiedemann,et al.  OpenSubtitles2018: Statistical Rescoring of Sentence Alignments in Large, Noisy Parallel Corpora , 2018, LREC.

[5]  Iryna Gurevych,et al.  WebAnno: A Flexible, Web-based and Visually Supported System for Distributed Annotations , 2013, ACL.

[6]  Chris Callison-Burch,et al.  PPDB: The Paraphrase Database , 2013, NAACL.

[7]  Chris Brockett,et al.  Automatically Constructing a Corpus of Sentential Paraphrases , 2005, IJCNLP.

[8]  Vasile Rus,et al.  On Paraphrase Identification Corpora , 2014, LREC.

[9]  Jörg Tiedemann,et al.  OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles , 2016, LREC.

[10]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[11]  Mathias Creutz,et al.  Paraphrase Detection on Noisy Subtitles in Six Languages , 2018, NUT@EMNLP.

[12]  Chris Callison-Burch,et al.  PPDB 2.0: Better paraphrase ranking, fine-grained entailment relations, word embeddings, and style classification , 2015, ACL.

[13]  Barbara Maria Di Eugenio,et al.  Squibs and Discussions - The Kappa Statistic , 2004 .

[14]  Mathias Creutz,et al.  Open Subtitles Paraphrase Corpus for Six Languages , 2018, LREC.

[15]  Lucia Specia,et al.  Collecting and Exploring Everyday Language for Predicting Psycholinguistic Properties of Words , 2016, COLING.

[16]  Horacio Rodríguez,et al.  Corpus annotation with paraphrase types: new annotation scheme and inter-annotator agreement measures , 2015, Lang. Resour. Evaluation.

[17]  Alberto Barrón-Cedeño,et al.  Plagiarism Meets Paraphrasing: Insights for the Next Generation in Automatic Plagiarism Detection , 2013, CL.