Inferred joint multigram models for medical term normalization according to ICD

BACKGROUND Electronic Health Records (EHRs) are written using spontaneous natural language. Often, terms do not match standard terminology like the one available through the International Classification of Diseases (ICD). OBJECTIVE Information retrieval and exchange can be improved using standard terminology. Our aim is to render diagnostic terms written in spontaneous language in EHRs into the standard framework provided by the ICD. METHODS We tackle diagnostic term normalization employing Weighted Finite-State Transducers (WFSTs). These machines learn how to translate sequences, in the case of our concern, spontaneous representations into standard representations given a set of samples. They are highly flexible and easily adaptable to terminological singularities of each different hospital and practitioner. Besides, we implemented a similarity metric to enhance spontaneous-standard term matching. RESULTS From the 2850 spontaneous DTs randomly selected we found that only 7.71% were written in their standard form matching the ICD. This WFST-based system enabled matching spontaneous ICDs with a Mean Reciprocal Rank of 0.68, which means that, on average, the right ICD code is found between the first and second position among the normalized set of candidates. This guarantees efficient document exchange and, furthermore, information retrieval. CONCLUSION Medical term normalization was achieved with high performance. We found that direct matching of spontaneous terms using standard lexicons leads to unsatisfactory results while normalized hypothesis generation by means of WFST helped to overcome the gap between spontaneous and standard language.

[1]  Nick Craswell Mean Reciprocal Rank , 2009, Encyclopedia of Database Systems.

[2]  Ian H. Witten,et al.  Text Compression , 1990, 125 Problems in Text Algorithms.

[3]  Kevin Knight,et al.  Bayesian Inference for Finite-State Transducers , 2010, HLT-NAACL.

[4]  K. Bretonnel Cohen,et al.  A shared task involving multi-label classification of clinical free text , 2007, BioNLP@ACL.

[5]  Vladimir Eidelman,et al.  cdec: A Decoder, Alignment, and Learning Framework for Finite- State and Context-Free Translation Models , 2010, ACL.

[6]  Keikichi Hirose,et al.  Results of aligning and reformatting the dictionary as a corpus of joint sequences . A ‘ , ’ indicates a oneto-many relationship , while ‘ , 2016 .

[7]  Frédéric Bimbot,et al.  Variable-length sequence matching for phonetic transcription using joint multigrams , 1995, EUROSPEECH.

[8]  W. Heeringa,et al.  Predicting intelligibility and perceived linguistic distance by means of the Levenshtein algorithm , 2008 .

[9]  Andrew Freeman,et al.  Cross Linguistic Name Matching in English and Arabic , 2006, NAACL.

[10]  Heljä Lundgrén-Laine,et al.  Characteristics of Finnish and Swedish intensive care nursing narratives: a comparative analysis to support the development of clinical language technologies , 2011, J. Biomed. Semant..

[11]  Mehryar Mohri,et al.  Finite-State Transducers in Language and Speech Processing , 1997, CL.

[12]  Hermann Ney,et al.  Investigations on joint-multigram models for grapheme-to-phoneme conversion , 2002, INTERSPEECH.

[13]  K. Bretonnel Cohen,et al.  CLEF eHealth 2017 Multilingual Information Extraction task Overview: ICD10 Coding of Death Certificates in English and French , 2017, CLEF.

[14]  T. Dijkstra,et al.  Distributions of cognates in Europe as based on Levenshtein distance* , 2008, Bilingualism: Language and Cognition.

[15]  Job Schepens,et al.  Distributions of cognates in Europe as based on Levenshtein distance* , 2008, Bilingualism: Language and Cognition.

[16]  Aron Henriksson,et al.  Improving Terminology Mapping in Clinical Text with Context-Sensitive Spelling Correction. , 2017, Studies in health technology and informatics.

[17]  Kevin Bretonnel Cohen,et al.  Biomedical Natural Language Processing , 2014 .

[18]  Hermann Ney,et al.  Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[19]  Jürgen Stausberg,et al.  Reliability of diagnoses coding with ICD-10 , 2008, Int. J. Medical Informatics.

[20]  Richárd Farkas,et al.  Automatic construction of rule-based ICD-9-CM coding systems , 2008, BMC Bioinformatics.

[21]  Iñaki Alegria,et al.  Evaluating the Noisy Channel Model for the Normalization of Historical Texts: Basque, Spanish and Slovene , 2016, LREC.

[22]  Thorsten Brants,et al.  Large Language Models in Machine Translation , 2007, EMNLP.

[23]  Özlem Uzuner,et al.  Three Approaches to Automatic Assignment of ICD-9-CM Codes to Radiology Reports , 2007, AMIA.

[24]  Frédéric Bimbot,et al.  Language modeling by variable length sequences: theoretical formulation and evaluation of multigrams , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.