论文信息 - SoMeWeTa: A Part-of-Speech Tagger for German Social Media and Web Texts

SoMeWeTa: A Part-of-Speech Tagger for German Social Media and Web Texts

Off-the-shelf part-of-speech taggers typically perform relatively poorly on web and social media texts since those domains are quite different from the newspaper articles on which most tagger models are trained. In this paper, we describe SoMeWeTa, a part-of-speech tagger based on the averaged structured perceptron that is capable of domain adaptation and that can use various external resources. We train the tagger on the German web and social media data of the EmpiriST 2015 shared task. Using the TIGER corpus as background data and adding external information about word classes and Brown clusters, we substantially improve on the state of the art for both the web and the social media data sets. The tagger is available as free software.

Thomas Proisl

[1] Stefan Evert,et al. Is Part-of-Speech Tagging a Solved Task? An Evaluation of POS Taggers for the German Web as Corpus , 2009 .

[2] Percy Liang,et al. Semi-Supervised Learning for Natural Language , 2005 .

[3] Robert L. Mercer,et al. Class-Based n-gram Models of Natural Language , 1992, CL.

[4] András Kornai,et al. HunPos: an open source trigram tagger , 2007, ACL 2007.

[5] Wolfgang Lezius,et al. TIGER: Linguistic Interpretation of a German Corpus , 2004 .

[6] Hal Daumé,et al. Frustratingly Easy Domain Adaptation , 2007, ACL.

[7] Stefan Thater,et al. UdS-(retrain|distributional|surface): Improving POS Tagging for OOV Words in German CMC and Web Data , 2016, WAC@ACL.

[8] Torsten Zesch,et al. FlexTag: A Highly Flexible PoS Tagging Framework , 2016, LREC.

[9] Sampo Pyysalo,et al. Universal Dependencies v1: A Multilingual Treebank Collection , 2016, LREC.

[10] Torsten Zesch,et al. LTL-UDE $@$ EmpiriST 2015: Tokenization and PoS Tagging of Social Media Text , 2016, WAC@ACL.

[11] Thomas Proisl,et al. SoMaJo: State-of-the-art tokenization for German web and social media texts , 2016, WAC@ACL.