Unsupervised Context-Sensitive Spelling Correction of Clinical Free-Text with Word and Character N-Gram Embeddings

We present an unsupervised contextsensitive spelling correction method for clinical free-text that uses word and character n-gram embeddings. Our method generates misspelling replacement candidates and ranks them according to their semantic fit, by calculating a weighted cosine similarity between the vectorized representation of a candidate and the misspelling context. We greatly outperform two baseline off-the-shelf spelling correction tools on a manually annotated MIMIC-III test set, and counter the frequency bias of an optimized noisy channel model, showing that neural embeddings can be successfully exploited to include context-awareness in a spelling correction model. Our source code, including a script to extract the annotated test data, can be found at https://github.com/ pieterfivez/bionlp2017.

[1]  Halil Kilicoglu,et al.  An Ensemble Method for Spelling Correction in Consumer Health Questions , 2015, AMIA.

[2]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[3]  Chen Lin,et al.  Temporal Annotation in the Clinical Domain , 2014, TACL.

[4]  Fang Liu,et al.  Bmc Medical Informatics and Decision Making a Umls-based Spell Checker for Natural Language Processing in Vaccine Safety , 2006 .

[5]  Walter Daelemans,et al.  Pattern for Python , 2012, J. Mach. Learn. Res..

[6]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[7]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[8]  Vivek Kumar Rangarajan Sridhar Unsupervised Text Normalization Using Distributed Representations of Words and Phrases , 2015, VS@HLT-NAACL.

[9]  Peter Szolovits,et al.  MIMIC-III, a freely accessible critical care database , 2016, Scientific Data.

[10]  Siddhartha Jonnalagadda,et al.  Towards a semantic lexicon for clinical natural language processing , 2012, AMIA.

[11]  Antoine Geissbühler,et al.  Using lexical disambiguation and named-entity recognition to improve spelling correction in the electronic patient record , 2003, Artif. Intell. Medicine.

[12]  Yaoyun Zhang,et al.  Clinical Abbreviation Disambiguation Using Neural Word Embeddings , 2015, BioNLP@IJCNLP.

[13]  Harshit Pande Effective search space reduction for spell correction using character neural embeddings , 2017, EACL.

[14]  Michael Flor,et al.  Four types of context for automatic spelling correction , 2012, TAL.

[15]  Li Zhou,et al.  Automated misspelling detection and correction in clinical free-text records , 2015, J. Biomed. Informatics.