The GUM corpus: creating multilayer resources in the classroom

This paper presents the methodology, design principles and detailed evaluation of a new freely available multilayer corpus, collected and edited via classroom annotation using collaborative software. After briefly discussing corpus design for open, extensible corpora, five classroom annotation projects are presented, covering structural markup in TEI XML, multiple part of speech tagging, constituent and dependency parsing, information structural and coreference annotation, and Rhetorical Structure Theory analysis. Layers are inspected for annotation quality and together they coalesce to form a richly annotated corpus that can be used to study the interactions between different levels of linguistic description. The evaluation gives an indication of the expected quality of a corpus created by students with relatively little training. A multifactorial example study on lexical NP coreference likelihood is also presented, which illustrates some applications of the corpus. The results of this project show that high quality, richly annotated resources can be created effectively as part of a linguistics curriculum, opening new possibilities not just for research, but also for corpora in linguistics pedagogy.

[1]  Jacob Andreas,et al.  Corpus Creation for New Genres: A Crowdsourced Approach to PP Attachment , 2010, Mturk@HLT-NAACL.

[2]  T. Givón,et al.  Topic continuity in discourse : a quantitative cross-language study , 1983 .

[3]  Werner Abraham Die Struktur typologischer DaF-Grammatiken , 1999 .

[4]  Stefan Evert,et al.  Using web data for linguistic purposes , 2007 .

[5]  Joakim Nivre,et al.  Universal Stanford dependencies: A cross-linguistic typology , 2014, LREC.

[6]  G. Meade Building a Discourse-Tagged Corpus in the Framework of Rhetorical Structure Theory , 2001 .

[7]  Helmut Schmidt,et al.  Probabilistic part-of-speech tagging using decision trees , 1994 .

[8]  William C. Mann,et al.  Rhetorical Structure Theory: Toward a functional theory of text organization , 1988 .

[9]  H. Kucera,et al.  Computational analysis of present-day American English , 1967 .

[10]  Vikas Sindhwani,et al.  Data Quality from Crowdsourcing: A Study of Annotation Selection Criteria , 2009, HLT-NAACL 2009.

[11]  Kathrin Beck,et al.  Stylebook for the Tubingen Treebank of Written German (TuBa-D/Z) , 2012 .

[12]  Mark Steedman,et al.  The NXT-format Switchboard Corpus: a rich resource for investigating the syntax, semantics, pragmatics and prosody of dialogue , 2010, Lang. Resour. Evaluation.

[13]  Samuel R. Bowman,et al.  A Gold Standard Dependency Corpus for English , 2014, LREC.

[14]  Manfred Stede,et al.  Corpus Linguistics and Information Structure Research , 2016 .

[15]  Douglas Biber,et al.  Representativeness in corpus design , 1993 .

[16]  Christopher Blackwell,et al.  Technology, Collaboration, and Undergraduate Research , 2009, Digit. Humanit. Q..

[17]  Gerhard Weikum,et al.  Crowdsourced Entity Markup , 2013, CrowdSem.

[18]  Iryna Gurevych,et al.  WebAnno: A Flexible, Web-based and Visually Supported System for Distributed Annotations , 2013, ACL.

[19]  Hanne Martine Eckhoff,et al.  Breaking down and putting back together: analysis and synthesis of New Testament Greek , 2009 .

[20]  Stefanie Dipper,et al.  Annotation of Information Structure: an Evaluation across different Types of Texts , 2008, LREC.

[21]  Kim Gerdes Collaborative Dependency Annotation , 2013, DepLing.

[22]  E. H. Hutten SEMANTICS , 1953, The British Journal for the Philosophy of Science.

[23]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[24]  Malvina Nissim,et al.  Learning Information Status of Discourse Entities , 2006, EMNLP.

[25]  Michael McCarthy,et al.  The Routledge Handbook of Corpus Linguistics , 2010 .

[26]  John J. Godfrey,et al.  SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[27]  Manfred Stede,et al.  Disambiguating Rhetorical Structure , 2008, Research on Language and Computation.

[28]  Christiane Fellbaum,et al.  The Manually Annotated Sub-Corpus: A Community Resource for and by the People , 2010, ACL.

[29]  Beatrice Santorini,et al.  Part-of-Speech Tagging Guidelines for the Penn Treebank Project (3rd Revision) , 1990 .

[30]  D. Marcu,et al.  Experiments in Constructing a Corpus of Discourse Trees : Problems , Annotation Choices , Issues , 1999 .

[31]  Randi Reppen,et al.  Building a corpus , 2010 .

[32]  THOMAS KRAUSE,et al.  MULTIPLE TOKENIZATIONS IN A DIACHRONIC CORPUS , 2012 .

[33]  Manfred Stede,et al.  The Potsdam Commentary Corpus , 2004, ACL 2004.

[34]  C. M. Sperberg-McQueen,et al.  Guidelines for electronic text encoding and interchange , 1994 .

[35]  Anne H. Anderson,et al.  The Hcrc Map Task Corpus , 1991 .

[36]  Roger Garside,et al.  A hybrid grammatical tagger: CLAWS4 , 1997 .

[37]  Daniel Jurafsky,et al.  Parsing to Stanford Dependencies: Trade-offs between Speed and Accuracy , 2010, LREC.

[38]  Janet M. Baker,et al.  The Design for the Wall Street Journal-based CSR Corpus , 1992, HLT.

[39]  Christopher D. Manning,et al.  Stanford typed dependencies manual , 2010 .

[40]  Lynette Hirschman,et al.  Automating Coreference: The Role of Annotated Training Data , 1998, ArXiv.

[41]  Dan Klein,et al.  Easy Victories and Uphill Battles in Coreference Resolution , 2013, EMNLP.

[42]  Mitchell P. Marcus,et al.  OntoNotes: The 90% Solution , 2006, NAACL.

[43]  Amir Zeldes,et al.  ANNIS3: A new architecture for generic corpus query and visualization , 2016, Digit. Scholarsh. Humanit..

[44]  Steve Crowdy Spoken Corpus Design , 1993 .

[45]  Seanna Doolittle,et al.  Das Lernerkorpus Falko , 2008 .

[46]  Michael ODonnell,et al.  RSTTool 2.4 - A markup Tool for Rhetorical Structure Theory , 2000, INLG.

[47]  Caroline Féry,et al.  The Oxford handbook of information structure , 2016 .

[48]  Julia Ritz Using tf-idf-related Measures for Determining the Anaphoricity of Noun Phrases , 2010, KONVENS.

[49]  Andrew Y. Ng,et al.  Parsing with Compositional Vector Grammars , 2013, ACL.

[50]  J. Lyons Semantics: Index of personal names , 1977 .

[51]  Brendan T. O'Connor,et al.  Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks , 2008, EMNLP.

[52]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[53]  Nathanael Chambers,et al.  Unsupervised Learning of Narrative Schemas and their Participants , 2009, ACL.

[54]  Heeyoung Lee,et al.  Deterministic Coreference Resolution Based on Entity-Centric, Precision-Ranked Rules , 2013, CL.

[55]  Gosse Bouma,et al.  Multi-Layer Discourse Annotation of a Dutch Text Corpus , 2012, LREC.

[56]  Geoffrey Leech,et al.  Corpus Annotation: Linguistic Information from Computer Text Corpora , 1997 .

[57]  T. Givon Topic Continuity in Discourse , 1983 .

[58]  Manfred Stede,et al.  Potsdam Commentary Corpus 2.0: Annotation for Discourse Research , 2014, LREC.

[59]  Christopher Potts,et al.  The Life and Death of Discourse Entities: Identifying Singleton Mentions , 2013, NAACL.

[60]  W. Mann,et al.  Rhetorical Structure Theory: looking back and moving ahead , 2006 .

[61]  Scott Weinstein,et al.  Centering: A Framework for Modeling the Local Coherence of Discourse , 1995, CL.

[62]  Stavros Skopeteas,et al.  Information Structure in Cross-Linguistic Corpora: , 2007 .

[63]  J. Sinclair Trust the text , 2002 .

[64]  Kalina Bontcheva,et al.  Corpus Annotation through Crowdsourcing: Towards Best Practice Guidelines , 2014, LREC.

[65]  Danqi Chen,et al.  A Fast and Accurate Dependency Parser using Neural Networks , 2014, EMNLP.

[66]  Markus Dickinson,et al.  Inter-annotator Agreement for Dependency Annotation of Learner Language , 2013, BEA@NAACL-HLT.