Automatic Morphosyntactic Analaysis of Clinical Text

Electronical health records, also called clinical texts, have their own linguistic characteristics and have been shown to deviate from standard language. Therefore, computational linguistics tools trained on standard language presumably do not achieve the same accuracy when applied to clinical data. In this paper, we describe a pipeline of tools for the automatic processing of clinical texts in Swedish from tokenization through part-of-speech tagging and dependency parsing. The evaluation of the components of the pipeline shows that existing NLP tools can be used, but performance drops greatly when models trained on standard language are applied to clinical data. We also present a small, syntactically annotated data set of clinical text to serve as gold standard.