From Noisy Questions to Minecraft Texts: Annotation Challenges in Extreme Syntax Scenario

User-generated content presents many challenges for its automatic processing. While many of them do come from out-of-vocabulary effects, others spawn from different linguistic phenomena such as unusual syntax. In this work we present a French three-domain data set made up of question headlines from a cooking forum, game chat logs and associated forums from two popular online games (MINECRAFT & LEAGUE OF LEGENDS). We chose these domains because they encompass different degrees of lexical and syntactic compliance with canonical language. We conduct an automatic and manual evaluation of the difficulties of processing these domains for part-of-speech prediction, and introduce a pilot study to determine whether dependency analysis lends itself well to annotate these data. We also discuss the development cost of our data set.

[1]  Barbara Plank,et al.  Non-canonical language is not harder to annotate than canonical language , 2015, LAW@NAACL-HLT.

[2]  Marie Candito,et al.  Improving generative statistical parsing with semi-supervised word clustering , 2009, IWPT.

[3]  Jennifer Foster "cba to check the spelling": Investigating Parser Performance on Discussion Forum Posts , 2010, HLT-NAACL.

[4]  Marie Candito,et al.  Le corpus Sequoia : annotation syntaxique et exploitation pour l’adaptation d’analyseur par pont lexical (The Sequoia Corpus : Syntactic Annotation and Use for a Parser Lexical Domain Adaptation Method) [in French] , 2012, JEP/TALN/RECITAL.

[5]  Éric Villemonte de la Clergerie,et al.  Deep Syntax Annotation of the Sequoia French Treebank , 2014, LREC.

[6]  Joakim Nivre,et al.  Benchmarking of Statistical Dependency Parsers for French , 2010, COLING.

[7]  Daniel Gildea,et al.  Corpus Variation and Parser Performance , 2001, EMNLP.

[8]  Matthew Lease,et al.  Parsing Biomedical Literature , 2005, IJCNLP.

[9]  Joseph Le Roux,et al.  Foreebank: Syntactic Analysis of Customer Support Forums , 2015, EMNLP.

[10]  Micha Elsner,et al.  Disentangling Chat with Local Coherence Models , 2011, ACL.

[11]  Caroline Brun,et al.  Part of Speech Tagging for French Social Media Data , 2014, COLING.

[12]  Josef van Genabith,et al.  From News to Comment: Resources and Benchmarks for Parsing the Language of Web 2.0 , 2011, IJCNLP.

[13]  Charles A. Perfetti,et al.  Comprehending newspaper headlines , 1987 .

[14]  Benoît Sagot,et al.  The French Social Media Bank: a Treebank of Noisy User Generated Content , 2012, COLING.

[15]  Josef van Genabith,et al.  Adapting WSJ-Trained Parsers to the British National Corpus using In-Domain Self-Training , 2007, IWPT.

[16]  Eugene Charniak,et al.  Self-Training for Biomedical Parsing , 2008, ACL.

[17]  Jacob Eisenstein,et al.  What to do about bad language on the internet , 2013, NAACL.

[18]  Dirk Hovy,et al.  Linguistically debatable or just plain wrong? , 2014, ACL.

[19]  Alexandra Kinyon,et al.  Building a Treebank for French , 2000, LREC.

[20]  Slav Petrov,et al.  Overview of the 2012 Shared Task on Parsing the Web , 2012 .

[21]  Pascal Denis,et al.  Coupling an annotated corpus and a lexicon for state-of-the-art POS tagging , 2012, Lang. Resour. Evaluation.

[22]  Marie Candito,et al.  Hard Time Parsing Questions: Building a QuestionBank for French , 2016, LREC.

[23]  Nathan Schneider What I've learned about annotating informal text (and why you shouldn't take my word for it) , 2015, LAW@NAACL-HLT.

[24]  Eugene Charniak,et al.  Effective Self-Training for Parsing , 2006, NAACL.

[25]  Christopher D. Manning,et al.  Does Universal Dependencies need a parsing representation? An investigation of English , 2015, DepLing.

[26]  Eugene Charniak,et al.  Reranking and Self-Training for Parser Adaptation , 2006, ACL.

[27]  Sampo Pyysalo,et al.  Universal Dependencies v1: A Multilingual Treebank Collection , 2016, LREC.

[28]  Nicolas Lefebvre,et al.  Crowdsourcing Complex Language Resources: Playing to Annotate Dependency Syntax , 2016, COLING.

[29]  Brendan T. O'Connor,et al.  Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments , 2010, ACL.

[30]  Josef van Genabith,et al.  #hardtoparse: POS Tagging and Parsing the Twitterverse , 2011, Analyzing Microtext.