Professional language in Swedish clinical text : Linguistic characterization and comparative studies

This study investigates the linguistic characteristics of Swedish clinical text in radiology reports and doctor's daily notes from electronic health records (EHRs) in comparison to general Swedish and biomedical journal text. We quantify linguistic features through a comparative register analysis to determine how the free text of EHRs differ from general and biomedical Swedish text in terms of lexical complexity, word and sentence composition, and common sentence structures. The linguistic features are extracted using state-of-the-art computational tools: a tokenizer, a part-of-speech tagger, and scripts for statistical analysis. Results show that technical terms and abbreviations are more frequent in clinical text, and lexical variance is low. Moreover, clinical text frequently omit subjects, verbs, and function words resulting in shorter sentences. Clinical text not only differs from general Swedish, but also internally, across its sub-domains, e.g. sentences lacking verbs are significantly more frequent in radiology reports. These results provide a foundation for future development of automatic methods for EHR simplification or clarification.

[1]  Christopher G. Chute,et al.  Domain-specific language models and lexicons for tagging , 2005, J. Biomed. Informatics.

[2]  Maria Kvist,et al.  Initial Results in the Development of SCAN A Swedish Clinical Abbreviation Normalizer , 2012, CLEF.

[3]  Christian Smith,et al.  A good space: Lexical predictors in word space evaluation , 2012, LREC.

[4]  Maria Kvist,et al.  Improving Readability of Swedish Electronic Health Records through Lexical Simplification: First Results , 2014, PITR@EACL.

[5]  H. Dalianis,et al.  The Stockholm EPR Corpus – Characteristics and Some Initial Findings , 2009 .

[6]  Hua Xu,et al.  Syntactic parsing of clinical text: guideline and corpus development with handling ill-formed sentences. , 2013, Journal of the American Medical Informatics Association : JAMIA.

[7]  Kazuhiko Ohe,et al.  TEXT2TABLE: Medical Text Summarization System Based on Named Entity Recognition and Modality Identification , 2009, BioNLP@HLT-NAACL.

[8]  Raymond L. Ownby,et al.  Influence of Vocabulary and Sentence Complexity and Passive Voice on the Readability of Consumer-Oriented Mental Health Information on the Internet , 2005, AMIA.

[9]  Ted Pedersen,et al.  Abbreviation and Acronym Disambiguation in Clinical Discourse , 2005, AMIA.

[10]  Robert Östling,et al.  Stagger: an Open-Source Part of Speech Tagger for Swedish , 2013 .

[11]  John F. Hurdle,et al.  Extracting Information from Textual Documents in the Electronic Health Record: A Review of Recent Research , 2008, Yearbook of Medical Informatics.

[12]  Hercules Dalianis,et al.  Stockholm EPR Corpus : A Clinical Database Used to Improve Health Care , 2012 .

[13]  Michael Krauthammer,et al.  Term identification in the biomedical literature , 2004, J. Biomed. Informatics.

[14]  Hongfang Liu,et al.  A study of abbreviations in the UMLS , 2001, AMIA.

[15]  Christopher G. Chute,et al.  Prospective recruitment of patients with congestive heart failure using an ad-hoc binary classifier , 2005, J. Biomed. Informatics.

[16]  Udo Hahn,et al.  High-Performance Tagging on Medical Texts , 2004, COLING.

[17]  Jim Warren,et al.  Assessing text characteristics of electronic discharge summaries and their implications for patient readability , 2010 .

[18]  Alla Keselman,et al.  Towards Consumer-Friendly PHRs: Patients' Experience with Reviewing Their Health Records , 2007, AMIA.

[19]  David A. Campbell,et al.  Comparing syntactic complexity in medical and non-medical corpora , 2001, AMIA.

[20]  C. Pyper,et al.  Patients' experiences when accessing their on-line electronic patient records in primary care. , 2004, The British journal of general practice : the journal of the Royal College of General Practitioners.

[21]  Peter J. Haug,et al.  Improving performance of natural language processing part-of-speech tagging on clinical narratives through domain adaptation , 2013, J. Am. Medical Informatics Assoc..

[22]  Maria Kvist,et al.  Modeling human comprehension of Swedish medical records for intelligent access and summarization systems - Future vision, a physician's perspective , 2011 .

[23]  Udo Hahn,et al.  A Reappraisal of Sentence and Token Splitting for Life Sciences Documents , 2007, MedInfo.

[24]  Baowei Fei,et al.  Research and applications: Multiscale segmentation of the skull in MR images for MRI-based attenuation correction of combined MR/PET , 2013, J. Am. Medical Informatics Assoc..

[25]  Maria Kvist,et al.  Professional Language in Swedish Radiology Reports - Characterization for Patient-Adapted Text Simplification , 2013 .

[26]  Carol Friedman,et al.  Two biomedical sublanguages: a description based on the theories of Zellig Harris , 2002, J. Biomed. Informatics.

[27]  Carol Friedman,et al.  A Study of Abbreviations in Clinical Notes , 2007, AMIA.

[28]  Heljä Lundgrén-Laine,et al.  Characteristics of Finnish and Swedish intensive care nursing narratives: a comparative analysis to support the development of clinical language technologies , 2011, J. Biomed. Semant..

[29]  Maria Kvist,et al.  Rule-based Entity Recognition and Coverage of SNOMED CT in Swedish Clinical Text , 2012, LREC.

[30]  Sonja Zillner,et al.  Identifying Pathological Findings in German Radiology Reports Using a Syntacto-semantic Parsing Approach , 2013, BioNLP@ACL.