论文信息 - Treebanking User-Generated Content: A Proposal for a Unified Representation in Universal Dependencies

Treebanking User-Generated Content: A Proposal for a Unified Representation in Universal Dependencies

The paper presents a discussion on the main linguistic phenomena of user-generated texts found in web and social media, and proposes a set of annotation guidelines for their treatment within the Universal Dependencies (UD) framework. Given on the one hand the increasing number of treebanks featuring user-generated content, and its somewhat inconsistent treatment in these resources on the other, the aim of this paper is twofold: (1) to provide a short, though comprehensive, overview of such treebanks - based on available literature - along with their main features and a comparative analysis of their annotation criteria, and (2) to propose a set of tentative UD-based annotation guidelines, to promote consistent treatment of the particular phenomena found in these types of texts. The main goal of this paper is to provide a common framework for those teams interested in developing similar resources in UD, thus enabling cross-linguistic consistency, which is a principle that has always been in the spirit of UD.

[1] Sampo Pyysalo,et al. Universal Dependencies v1: A Multilingual Treebank Collection , 2016, LREC.

[2] Benoît Sagot,et al. From Noisy Questions to Minecraft Texts: Annotation Challenges in Extreme Syntax Scenario , 2016, NUT@COLING.

[3] Riyaz Ahmad Bhat,et al. Universal Dependency Parsing for Hindi-English Code-Switching , 2018, NAACL.

[4] Allan Ramsay,et al. Universal Dependencies for Arabic Tweets , 2017, RANLP.

[5] Amir Zeldes,et al. The GUM corpus: creating multilayer resources in the classroom , 2016, Language Resources and Evaluation.

[6] Cristina Bosco,et al. PoSTWITA-UD: an Italian Twitter Treebank in Universal Dependencies , 2018, LREC.

[7] Beatrice Santorini,et al. Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[8] Brendan T. O'Connor,et al. Twitter Universal Dependency Parsing for African-American and Mainstream American English , 2018, ACL.

[9] Brendan T. O'Connor,et al. Improved Part-of-Speech Tagging for Online Conversational Text with Word Clusters , 2013, NAACL.

[10] Julia Hirschberg,et al. Overview for the First Shared Task on Language Identification in Code-Switched Data , 2014, CodeSwitch@EMNLP.

[11] Steven Bird,et al. NLTK: The Natural Language Toolkit , 2002, ACL.

[12] Josef Ruppenhofer,et al. tweeDe – A Universal Dependencies treebank for German tweets , 2019 .

[13] Teresa Lynn,et al. Minority Language Twitter: Part-of-Speech Tagging and Analysis of Irish Tweets , 2015, NUT@IJCNLP.

[14] Christopher D. Manning,et al. Gapping Constructions in Universal Dependencies v2 , 2017, UDW@NoDaLiDa.

[15] Anne Lacheret,et al. Rhapsodie: a Prosodic-Syntactic Treebank for Spoken French , 2014, LREC.

[16] Yang Liu,et al. Part-of-Speech Tagging for English-Spanish Code-Switched Text , 2008, EMNLP.

[17] Xinying Chen,et al. Developing Universal Dependencies for Mandarin Chinese , 2016, ALR@COLING.

[18] Jennifer Foster. "cba to check the spelling": Investigating Parser Performance on Discussion Forum Posts , 2010, HLT-NAACL.

[19] Teresa Lynn,et al. Code-switching in Irish tweets: A preliminary analysis , 2019 .

[20] Gülsen Eryigit,et al. The Annotation Process of the ITU Web Treebank , 2015, LAW@NAACL-HLT.

[21] Joakim Nivre,et al. The Universal Dependencies Treebank of Spoken Slovenian , 2016, LREC.

[22] Joachim Daiber,et al. The Denoised Web Treebank: Evaluating Dependency Parsing under Noisy Input Conditions , 2016, LREC.

[23] Ines Rehbein. Filled Pauses in User-generated Content are Words with Extra-propositional Meaning , 2015 .

[24] Lilja Øvrelid,et al. Universal Dependencies for Norwegian , 2016, LREC.

[25] Joseph Le Roux,et al. Foreebank: Syntactic Analysis of Customer Support Forums , 2015, EMNLP.

[26] Yue Zhang,et al. Universal Dependencies Parsing for Colloquial Singaporean English , 2017, ACL.

[27] Benoît Sagot,et al. The French Social Media Bank: a Treebank of Noisy User Generated Content , 2012, COLING.

[28] Samuel R. Bowman,et al. A Gold Standard Dependency Corpus for English , 2014, LREC.

[29] Christopher D. Manning,et al. Enhanced English Universal Dependencies: An Improved Representation for Natural Language Understanding Tasks , 2016, LREC.

[30] Paolo Rosso,et al. Presenting TWITTIRÒ-UD: An Italian Twitter Treebank in Universal Dependencies , 2019, Proceedings of the Fifth International Conference on Dependency Linguistics (Depling, SyntaxFest 2019).

[31] Sylvain Kahane,et al. A Surface-Syntactic UD Treebank for Naija , 2019, Proceedings of the 18th International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2019).

[32] Slav Petrov,et al. Overview of the 2012 Shared Task on Parsing the Web , 2012 .

[33] Jacob Eisenstein,et al. What to do about bad language on the internet , 2013, NAACL.

[34] Yijia Liu,et al. Parsing Tweets into Universal Dependencies , 2018, NAACL.

[35] Veronika Laippala,et al. Towards Universal Web Parsebanks , 2015, DepLing.

[36] Brendan T. O'Connor,et al. Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments , 2010, ACL.

[37] Thomas Proisl. SoMeWeTa: A Part-of-Speech Tagger for German Social Media and Web Texts , 2018, LREC.

[38] Özlem Çetinoglu,et al. A Turkish-German Code-Switching Corpus , 2016, LREC.

[39] William Yang Wang,et al. Dependency Parsing for Weibo: An Efficient Probabilistic Logic Programming Approach , 2014, EMNLP.

[40] Noah A. Smith,et al. A Dependency Parser for Tweets , 2014, EMNLP.

[41] Josef van Genabith,et al. From News to Comment: Resources and Benchmarks for Parsing the Language of Web 2.0 , 2011, IJCNLP.