SoMeWeTa: A Part-of-Speech Tagger for German Social Media and Web Texts

Off-the-shelf part-of-speech taggers typically perform relatively poorly on web and social media texts since those domains are quite different from the newspaper articles on which most tagger models are trained. In this paper, we describe SoMeWeTa, a part-of-speech tagger based on the averaged structured perceptron that is capable of domain adaptation and that can use various external resources. We train the tagger on the German web and social media data of the EmpiriST 2015 shared task. Using the TIGER corpus as background data and adding external information about word classes and Brown clusters, we substantially improve on the state of the art for both the web and the social media data sets. The tagger is available as free software.

[1]  Stefan Evert,et al.  Is Part-of-Speech Tagging a Solved Task? An Evaluation of POS Taggers for the German Web as Corpus , 2009 .

[2]  Percy Liang,et al.  Semi-Supervised Learning for Natural Language , 2005 .

[3]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[4]  András Kornai,et al.  HunPos: an open source trigram tagger , 2007, ACL 2007.

[5]  Wolfgang Lezius,et al.  TIGER: Linguistic Interpretation of a German Corpus , 2004 .

[6]  Hal Daumé,et al.  Frustratingly Easy Domain Adaptation , 2007, ACL.

[7]  Stefan Thater,et al.  UdS-(retrain|distributional|surface): Improving POS Tagging for OOV Words in German CMC and Web Data , 2016, WAC@ACL.

[8]  Torsten Zesch,et al.  FlexTag: A Highly Flexible PoS Tagging Framework , 2016, LREC.

[9]  Sampo Pyysalo,et al.  Universal Dependencies v1: A Multilingual Treebank Collection , 2016, LREC.

[10]  Torsten Zesch,et al.  LTL-UDE $@$ EmpiriST 2015: Tokenization and PoS Tagging of Social Media Text , 2016, WAC@ACL.

[11]  Thomas Proisl,et al.  SoMaJo: State-of-the-art tokenization for German web and social media texts , 2016, WAC@ACL.

[12]  Christian Biemann,et al.  GermaNER: Free Open German Named Entity Recognition Tool , 2015, GSCL.

[13]  Roland Schäfer,et al.  Processing and querying large web corpora with the COW14 architecture , 2015 .

[14]  Alex Acero,et al.  Adaptation of Maximum Entropy Capitalizer: Little Data Can Help a Lo , 2006, Comput. Speech Lang..

[15]  Michael Collins,et al.  Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms , 2002, EMNLP.

[16]  F ROSENBLATT,et al.  The perceptron: a probabilistic model for information storage and organization in the brain. , 1958, Psychological review.

[17]  Christian Biemann,et al.  EmpiriST: AIPHES - Robust Tokenization and POS-Tagging for Different Genres , 2016, WAC@ACL.

[18]  Stefan Evert,et al.  EmpiriST 2015: A Shared Task on the Automatic Linguistic Annotation of Computer-Mediated Communication and Web Corpora , 2016, WAC@ACL.

[19]  Yoav Freund,et al.  Large Margin Classification Using the Perceptron Algorithm , 1998, COLT.

[20]  Brian Roark,et al.  Incremental Parsing with the Perceptron Algorithm , 2004, ACL.

[21]  Roland Schäfer,et al.  Building Large Corpora from the Web Using a New Efficient Tool Chain , 2012, LREC.