Efficient Training Data Enrichment and Unknown Token Handling for POS Tagging of Nonstandardized Texts

In this work we consider the problem of social media text Part-of-Speech tagging as fundamental task for Natural Language Processing. We present improvements to a social media Markov model tagger, by adapting parameter estimation methods for unknown tokens. In addition, we propose to enrich the social media text corpus by a linear combination with a newspaper training corpus. Applying our tagger to a social media text corpus results in accuracies of around 94.8%, which comes close to accuracies for standardized texts. 1

[1]  Rudolf Mathar,et al.  A POS Tagger for Social Media Texts Trained on Web Comments , 2013, Polibits.

[2]  Chris Dyer,et al.  Part-of-Speech Tagging for Twitter : Word Clusters and Other Advances , 2012 .

[3]  Brendan T. O'Connor,et al.  Improved Part-of-Speech Tagging for Online Conversational Text with Word Clusters , 2013, NAACL.

[4]  Brendan T. O'Connor,et al.  Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments , 2010, ACL.

[5]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[6]  Wolfgang Lezius,et al.  TIGER: Linguistic Interpretation of a German Corpus , 2004 .

[7]  Yoshua Bengio,et al.  Inference for the Generalization Error , 1999, Machine Learning.

[8]  Rudolf Mathar,et al.  Part-Of-Speech Tagging for Social Media Texts , 2013, GSCL.

[9]  Ines Rehbein Fine-Grained POS Tagging of German Tweets , 2013, GSCL.

[10]  S. Dick,et al.  Applying Novel Resampling Strategies To Software Defect Prediction , 2007, NAFIPS 2007 - 2007 Annual Meeting of the North American Fuzzy Information Processing Society.

[11]  Eva-Maria Jakobs,et al.  A multi-level annotation model for fine-grained opinion detection in German blog comments , 2012, KONVENS.

[12]  Stefan Evert,et al.  Is Part-of-Speech Tagging a Solved Task? An Evaluation of POS Taggers for the German Web as Corpus , 2009 .

[13]  Helmut Schmid,et al.  Improvements in Part-of-Speech Tagging with an Application to German , 1999 .

[14]  Tanveer A. Faruquie,et al.  Adapting a WSJ trained part-of-speech tagger to noisy text: preliminary results , 2011, MOCR_AND '11.