Really, Is Medical Sublanguage That Different? Experimental Counter-evidence from Tagging Medical and Newspaper Corpora

We compare the performance of two part-of-speech taggers trained on a German newspaper corpus for mixed types of medical documents. TnT, a tagger based on a statistical language model, outperforms Brill's rule-based tagger, and supplied with additional lexicon resources matches state-of-the-art performance figures (close to 97% accuracy) on the medical corpus. We explain this unexpected result by focusing on the statistically significant part-of-speech type overlap between the newspaper training set and the medical test set. At least at that level, sublanguage differences seem to vanish. Thus, statistical off-the-shelf part-of-speech taggers can immediately be reused for medical language processing