Combining string and phonetic similarity matching to identify misspelt names of drugs in medical records written in Portuguese

Background There is an increasing amount of unstructured medical data that can be analysed for different purposes. However, information extraction from free text data may be particularly inefficient in the presence of spelling errors. Existing approaches use string similarity methods to search for valid words within a text, coupled with a supporting dictionary. However, they are not rich enough to encode both typing and phonetic misspellings. Results Experimental results showed a joint string and language-dependent phonetic similarity is more accurate than traditional string distance metrics when identifying misspelt names of drugs in a set of medical records written in Portuguese. Conclusion We present a hybrid approach to efficiently perform similarity match that overcomes the loss of information inherit from using either exact match search or string based similarity search methods.

[1]  Özlem Uzuner,et al.  Extracting medication information from clinical text , 2010, J. Am. Medical Informatics Assoc..

[2]  P. Ladefoged,et al.  The sounds of the world's languages , 1996 .

[3]  Justin Zobel,et al.  Phonetic string matching: lessons from information retrieval , 1996, SIGIR '96.

[4]  Goran Nenadic,et al.  Modelling and extraction of variability in free-text medication prescriptions from an anonymised primary care electronic medical record research database , 2015, BMC Medical Informatics and Decision Making.

[5]  Marcos Didonet Del Fabro,et al.  Fast Phonetic Similarity Search over Large Repositories , 2014, DEXA.

[6]  Richard W. Hamming,et al.  Error detecting and error correcting codes , 1950 .

[7]  Alex Acero,et al.  Context dependent phonetic string edit distance for automatic speech recognition , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Cristina Bona Avaliação de processos de software: um estudo de caso em XP e ICONIX , 2002 .

[9]  S. Brunak,et al.  Mining electronic health records: towards better research applications and clinical care , 2012, Nature Reviews Genetics.

[10]  Felix Naumann,et al.  Efficient Similarity Search in Very Large String Sets , 2012, SSDBM.

[11]  Ashish Verma,et al.  Building re-usable dictionary repositories for real-world text mining , 2010, CIKM '10.

[12]  Madian Khabsa,et al.  AckSeer: a repository and search engine for automatically extracted acknowledgments from digital libraries , 2012, JCDL '12.

[13]  Besiki Stvilia,et al.  A model for ontology quality evaluation , 2007, First Monday.

[14]  Walter E. Haefeli,et al.  Misspellings in drug information system queries: Characteristics of drug name spelling errors and strategies for their prevention , 2010, Int. J. Medical Informatics.

[15]  Mohammed El Mohajir,et al.  An ontology-based approach for web information extraction , 2011, 2011 Colloquium in Information Science and Technology.

[16]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[17]  William E. Winkler,et al.  String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage. , 1990 .

[18]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[19]  Guoliang Li,et al.  Efficient interactive fuzzy keyword search , 2009, WWW '09.

[20]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[21]  Leon Derczynski,et al.  Normalisation of imprecise temporal expressions extracted from text , 2019, Knowledge and Information Systems.

[22]  Jérôme Euzenat,et al.  Ontology Matching: State of the Art and Future Challenges , 2013, IEEE Transactions on Knowledge and Data Engineering.

[23]  R. Ewy,et al.  ABSTRACT , 1986 .