LTL-UDE $@$ EmpiriST 2015: Tokenization and PoS Tagging of Social Media Text

We present a detailed description of our submission to the EmpiriST shared task 2015 for tokenization and part-of-speech tagging of German social media text. As relatively little training data is provided, neither tokenization nor PoS tagging can be learned from the data alone. For tokenization, our system uses regular expressions for general cases and word lists for exceptions. For PoS tagging, adding unsupervised knowledge beyond the available training data is the most important factor for reaching acceptable tagging accuracy. A learning curve experiment shows furthermore that more in-domain training data is very likely to further increase accuracy.

[1]  Wolfgang Menzel,et al.  Because Size Does Matter: The Hamburg Dependency Treebank , 2014, LREC.

[2]  Helmut Schmid,et al.  Improvements in Part-of-Speech Tagging with an Application to German , 1999 .

[3]  Oliver Ferschke,et al.  DKPro TC: A Java-based Framework for Supervised Learning Experiments on Textual Data , 2014, ACL.

[4]  Wolfgang Lezius,et al.  TIGER: Linguistic Interpretation of a German Corpus , 2004 .

[5]  Brendan T. O'Connor,et al.  Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments , 2010, ACL.

[6]  Oren Etzioni,et al.  Named Entity Recognition in Tweets: An Experimental Study , 2011, EMNLP.

[7]  Iryna Gurevych,et al.  A broad-coverage collection of portable NLP components for building shareable analysis pipelines , 2014, OIAF4HLT@COLING.

[8]  Torsten Zesch,et al.  FlexTag: A Highly Flexible PoS Tagging Framework , 2016, LREC.

[9]  Ines Rehbein Fine-Grained POS Tagging of German Tweets , 2013, GSCL.

[10]  Torsten Zesch,et al.  Fast or Accurate? - A Comparative Evaluation of PoS Tagging Models , 2015, GSCL.

[11]  Erhard W. Hinrichs,et al.  The Tüba-D/Z Treebank: Annotating German with a Context-Free Backbone , 2004, LREC.

[12]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[13]  Angelika Storrer,et al.  Tagset und Richtlinie für das PoSTagging von Sprachdaten aus Genres internetbasierter Kommunikation , 2015 .

[14]  Brendan T. O'Connor,et al.  Improved Part-of-Speech Tagging for Online Conversational Text with Word Clusters , 2013, NAACL.

[15]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[16]  Stefan Evert,et al.  EmpiriST 2015: A Shared Task on the Automatic Linguistic Annotation of Computer-Mediated Communication and Web Corpora , 2016, WAC@ACL.

[17]  Jacob Eisenstein,et al.  What to do about bad language on the internet , 2013, NAACL.

[18]  Torsten Zesch,et al.  Effectiveness of Domain Adaptation Approaches for Social Media PoS Tagging , 2015 .