论文信息 - Implementation and evaluation of a text extraction tool for adverse drug reaction information

Implementation and evaluation of a text extraction tool for adverse drug reaction information

Background: Initial review of potential safety issues related to the use of medicines involves reading and searching existing medical literature sources for known associations of drug and adverse drug reactions (ADRs), so that they can be excluded from further analysis. The task is labor demanding and time consuming. Objective: To develop a text extraction tool to automatically identify ADR information from medical adverse effects texts. Evaluate the performance of the tool’s underlying text extraction algorithm and identify what parts of the algorithm contributed to the performance. Method: A text extraction tool was implemented on the .NET platform with functionality for preprocessing text (removal of stop words, Porter stemming and use of synonyms) and matching medical terms using permutations of words and spelling variations (Soundex, Levenshtein distance and Longest common subsequence distance). Its performance was evaluated on both manually extracted medical terms (semi-structuredtexts) from summary of product characteristics (SPC) texts and unstructured adverse effects texts from Martindale (i.e. a medical reference for information about drugs andmedicines) using the WHO-ART and MedDRA medical term dictionaries. Results: For the SPC data set, a verbatim match identified 72% of the SPC terms. The text extraction tool correctly matched 87% of the SPC terms while producing one false positive match using removal of stop words, Porter stemming, synonyms and permutations. The use of the full MedDRA hierarchy contributed the most to performance. Sophisticated text algorithms together contributed roughly equally to the performance. Phonetic codes (i.e. Soundex) is evidently inferior to string distance measures (i.e. Levenshtein distance and Longest common subsequence distance) for fuzzy matching in our implementation. The string distance measures increased the number of matched SPC terms, but at the expense of generating false positive matches. Results from Martindaleshow that 90% of the identified medical terms were correct. The majority of false positive matches were caused by extracting medical terms not describing ADRs. Conclusion: Sophisticated text extraction can considerably improve the identification of ADR information from adverse effects texts compared to a verbatim extraction.

Gunnar Dahlberg | G. Dahlberg

[1] Gary Walsh,et al. Biopharmaceuticals: Biochemistry and Biotechnology , 1998 .

[2] A. Bate,et al. A Bayesian neural network method for adverse drug reaction signal generation , 1998, European Journal of Clinical Pharmacology.

[3] Mark R. Gilder,et al. Extraction of protein interaction information from unstructured text using a context-free grammar , 2003, Bioinform..

[4] Justin Zobel,et al. Finding approximate matches in large lexicons , 1995, Softw. Pract. Exp..

[5] M. Lindquist. VigiBase, the WHO Global ICSR Database System: Basic Facts , 2008 .

[6] M Lindquist,et al. Introducing triage logic as a new strategy for the detection of signals in the WHO Drug Monitoring Database , 2004, Pharmacoepidemiology and drug safety.

[7] Vasileios Hatzivassiloglou,et al. Disambiguating proteins, genes, and RNA in text: a machine learning approach , 2001, ISMB.

[8] A. Bate,et al. Extending the methods used to screen the WHO drug safety database towards analysis of complex associations and improved accuracy for rare events , 2006, Statistics in medicine.

[9] Donald Ervin Knuth,et al. The Art of Computer Programming , 1968 .

[10] Robert A. Wagner,et al. An Extension of the String-to-String Correction Problem , 1975, JACM.

[11] William R. Hersh,et al. A Survey of Current Work in Biomedical Text Mining , 2005 .