Part-of-Speech Tagging for Twitter : Word Clusters and Other Advances

We present improvements to a Twitter part-of-speech tagger, making use of several new features and largescale word clustering. With these changes, the tagging accuracy increased from 89.2% to 92.8% and the tagging speed is 40 times faster. In addition, we expanded our Twitter tokenizer to support a broader range of Unicode characters, emoticons, and URLs. Finally, we annotate and evaluate on a new tweet dataset, DAILYTWEET547, that is more statistically representative of English-language Twitter as a whole. The new tagger is released as TweetNLP version 0.3, along with the new annotated data and large-scale word clusters at http://www.ark.cs.cmu.edu/TweetNLP. This research was supported in part by an REU supplement to NSF grant IIS-0915187 and Google’s support of the Worldly Knowledge project at CMU.