POS Taggers typically fail to correctly tag grammatical neologisms: for known words, a tagger will only take known tags into account, and hence discard any possibility that the word is used in a novel or deviant grammatical category in the text at hand. Grammatical neologisms are relatively rare, and therefore do not pose a significant problem for the overall performance of a tagger. But for studies on neologisms and grammaticalization processes, this makes traditional taggers rather unfit. This article describes a modified POS tagger that explicitly considers new tags for known words, hence making it better fit for neologism research. This tagger, called NeoTag, has an overall accuracy that is comparable to other taggers, but scores much better for grammatical neologisms. To achieve this, the tagger applies a system of {\em lexical smoothing}, which adds new categories to known words based on known homographs. NeoTag also lemmatizes words as part of the tagging system, achieving a high accuracy on lemmatization for both known and unknown words, without the need for an external lexicon. The use of NeoTag is not restricted to grammatical neologism detection, and it can be used for other purposes as well.
[1]
Rosa Estopà Bagot,et al.
Trabajar en neología con un entorno integrado en línea: la estación de trabajo OBNEO
,
2009
.
[2]
Helmut Schmidt,et al.
Probabilistic part-of-speech tagging using decision trees
,
1994
.
[3]
Maarten Janssen,et al.
Open Source Lexical Information Network
,
2005
.
[4]
Slava M. Katz,et al.
Estimation of probabilities from sparse data for the language model component of a speech recognizer
,
1987,
IEEE Trans. Acoust. Speech Signal Process..
[5]
Maarten Janssen.
Orthographic Neologisms Selection Criteria and Semi-Automatic Detection
,
2005
.
[6]
F ChenStanley,et al.
An Empirical Study of Smoothing Techniques for Language Modeling
,
1996,
ACL.
[7]
Stanley F. Chen,et al.
An empirical study of smoothing techniques for language modeling
,
1999
.