Treebanking User-Generated Content: A Proposal for a Unified Representation in Universal Dependencies

The paper presents a discussion on the main linguistic phenomena of user-generated texts found in web and social media, and proposes a set of annotation guidelines for their treatment within the Universal Dependencies (UD) framework. Given on the one hand the increasing number of treebanks featuring user-generated content, and its somewhat inconsistent treatment in these resources on the other, the aim of this paper is twofold: (1) to provide a short, though comprehensive, overview of such treebanks - based on available literature - along with their main features and a comparative analysis of their annotation criteria, and (2) to propose a set of tentative UD-based annotation guidelines, to promote consistent treatment of the particular phenomena found in these types of texts. The main goal of this paper is to provide a common framework for those teams interested in developing similar resources in UD, thus enabling cross-linguistic consistency, which is a principle that has always been in the spirit of UD.

[1]  Sampo Pyysalo,et al.  Universal Dependencies v1: A Multilingual Treebank Collection , 2016, LREC.

[2]  Benoît Sagot,et al.  From Noisy Questions to Minecraft Texts: Annotation Challenges in Extreme Syntax Scenario , 2016, NUT@COLING.

[3]  Riyaz Ahmad Bhat,et al.  Universal Dependency Parsing for Hindi-English Code-Switching , 2018, NAACL.

[4]  Allan Ramsay,et al.  Universal Dependencies for Arabic Tweets , 2017, RANLP.

[5]  Amir Zeldes,et al.  The GUM corpus: creating multilayer resources in the classroom , 2016, Language Resources and Evaluation.

[6]  Cristina Bosco,et al.  PoSTWITA-UD: an Italian Twitter Treebank in Universal Dependencies , 2018, LREC.

[7]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[8]  Brendan T. O'Connor,et al.  Twitter Universal Dependency Parsing for African-American and Mainstream American English , 2018, ACL.

[9]  Brendan T. O'Connor,et al.  Improved Part-of-Speech Tagging for Online Conversational Text with Word Clusters , 2013, NAACL.

[10]  Julia Hirschberg,et al.  Overview for the First Shared Task on Language Identification in Code-Switched Data , 2014, CodeSwitch@EMNLP.

[11]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[12]  Josef Ruppenhofer,et al.  tweeDe – A Universal Dependencies treebank for German tweets , 2019 .

[13]  Teresa Lynn,et al.  Minority Language Twitter: Part-of-Speech Tagging and Analysis of Irish Tweets , 2015, NUT@IJCNLP.

[14]  Christopher D. Manning,et al.  Gapping Constructions in Universal Dependencies v2 , 2017, UDW@NoDaLiDa.

[15]  Anne Lacheret,et al.  Rhapsodie: a Prosodic-Syntactic Treebank for Spoken French , 2014, LREC.

[16]  Yang Liu,et al.  Part-of-Speech Tagging for English-Spanish Code-Switched Text , 2008, EMNLP.

[17]  Xinying Chen,et al.  Developing Universal Dependencies for Mandarin Chinese , 2016, ALR@COLING.

[18]  Jennifer Foster "cba to check the spelling": Investigating Parser Performance on Discussion Forum Posts , 2010, HLT-NAACL.

[19]  Teresa Lynn,et al.  Code-switching in Irish tweets: A preliminary analysis , 2019 .

[20]  Gülsen Eryigit,et al.  The Annotation Process of the ITU Web Treebank , 2015, LAW@NAACL-HLT.

[21]  Joakim Nivre,et al.  The Universal Dependencies Treebank of Spoken Slovenian , 2016, LREC.

[22]  Joachim Daiber,et al.  The Denoised Web Treebank: Evaluating Dependency Parsing under Noisy Input Conditions , 2016, LREC.

[23]  Ines Rehbein Filled Pauses in User-generated Content are Words with Extra-propositional Meaning , 2015 .

[24]  Lilja Øvrelid,et al.  Universal Dependencies for Norwegian , 2016, LREC.

[25]  Joseph Le Roux,et al.  Foreebank: Syntactic Analysis of Customer Support Forums , 2015, EMNLP.

[26]  Yue Zhang,et al.  Universal Dependencies Parsing for Colloquial Singaporean English , 2017, ACL.

[27]  Benoît Sagot,et al.  The French Social Media Bank: a Treebank of Noisy User Generated Content , 2012, COLING.

[28]  Samuel R. Bowman,et al.  A Gold Standard Dependency Corpus for English , 2014, LREC.

[29]  Christopher D. Manning,et al.  Enhanced English Universal Dependencies: An Improved Representation for Natural Language Understanding Tasks , 2016, LREC.

[30]  Paolo Rosso,et al.  Presenting TWITTIRÒ-UD: An Italian Twitter Treebank in Universal Dependencies , 2019, Proceedings of the Fifth International Conference on Dependency Linguistics (Depling, SyntaxFest 2019).

[31]  Sylvain Kahane,et al.  A Surface-Syntactic UD Treebank for Naija , 2019, Proceedings of the 18th International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2019).

[32]  Slav Petrov,et al.  Overview of the 2012 Shared Task on Parsing the Web , 2012 .

[33]  Jacob Eisenstein,et al.  What to do about bad language on the internet , 2013, NAACL.

[34]  Yijia Liu,et al.  Parsing Tweets into Universal Dependencies , 2018, NAACL.

[35]  Veronika Laippala,et al.  Towards Universal Web Parsebanks , 2015, DepLing.

[36]  Brendan T. O'Connor,et al.  Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments , 2010, ACL.

[37]  Thomas Proisl SoMeWeTa: A Part-of-Speech Tagger for German Social Media and Web Texts , 2018, LREC.

[38]  Özlem Çetinoglu,et al.  A Turkish-German Code-Switching Corpus , 2016, LREC.

[39]  William Yang Wang,et al.  Dependency Parsing for Weibo: An Efficient Probabilistic Logic Programming Approach , 2014, EMNLP.

[40]  Noah A. Smith,et al.  A Dependency Parser for Tweets , 2014, EMNLP.

[41]  Josef van Genabith,et al.  From News to Comment: Resources and Benchmarks for Parsing the Language of Web 2.0 , 2011, IJCNLP.