Data-Driven Information Extraction from Chinese Electronic Medical Records

Objective This study aims to propose a data-driven framework that takes unstructured free text narratives in Chinese Electronic Medical Records (EMRs) as input and converts them into structured time-event-description triples, where the description is either an elaboration or an outcome of the medical event. Materials and Methods Our framework uses a hybrid approach. It consists of constructing cross-domain core medical lexica, an unsupervised, iterative algorithm to accrue more accurate terms into the lexica, rules to address Chinese writing conventions and temporal descriptors, and a Support Vector Machine (SVM) algorithm that innovatively utilizes Normalized Google Distance (NGD) to estimate the correlation between medical events and their descriptions. Results The effectiveness of the framework was demonstrated with a dataset of 24,817 de-identified Chinese EMRs. The cross-domain medical lexica were capable of recognizing terms with an F1-score of 0.896. 98.5% of recorded medical events were linked to temporal descriptors. The NGD SVM description-event matching achieved an F1-score of 0.874. The end-to-end time-event-description extraction of our framework achieved an F1-score of 0.846. Discussion In terms of named entity recognition, the proposed framework outperforms state-of-the-art supervised learning algorithms (F1-score: 0.896 vs. 0.886). In event-description association, the NGD SVM is superior to SVM using only local context and semantic features (F1-score: 0.874 vs. 0.838). Conclusions The framework is data-driven, weakly supervised, and robust against the variations and noises that tend to occur in a large corpus. It addresses Chinese medical writing conventions and variations in writing styles through patterns used for discovering new terms and rules for updating the lexica.

[1]  Rumjahn Hoosain,et al.  Psycholinguistic Implications for Linguistic Relativity: A Case Study of Chinese , 1991 .

[2]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[3]  Ralph Grishman,et al.  NYU: Description of the MENE Named Entity System as Used in MUC-7 , 1998, MUC.

[4]  Sergey Brin,et al.  Extracting Patterns and Relations from the World Wide Web , 1998, WebDB.

[5]  Yoram Singer,et al.  Unsupervised Models for Named Entity Classification , 1999, EMNLP.

[6]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[7]  Fredric C. Gey,et al.  Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval , 1999, SIGIR 1999.

[8]  Rohini K. Srihari,et al.  A Hybrid Approach for Named Entity and Sub-Type Tagging , 2000, ANLP.

[9]  Sergei Nirenburg Proceedings of the sixth conference on Applied natural language processing , 2000 .

[10]  Paola Velardi,et al.  Unsupervised Named Entity Recognition Using Syntactic and Semantic Contextual Evidence , 2001, CL.

[11]  Gottfried Vossen,et al.  The World Wide Web and Databases , 2001, Lecture Notes in Computer Science.

[12]  Wei Li,et al.  Early results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons , 2003, CoNLL.

[13]  Walter Daelemans,et al.  Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4 , 2003 .

[14]  Richard J. Evans,et al.  A framework for named entity recognition in the open domain , 2003, RANLP.

[15]  Doug Downey,et al.  Web-scale information extraction in knowitall: (preliminary results) , 2004, WWW '04.

[16]  Peggy L. Peissig,et al.  Study of Effect of Drug Lexicons on Medication Extraction from Electronic Medical Records , 2004, Pacific Symposium on Biocomputing.

[17]  Daniel Jurafsky,et al.  A Conditional Random Field Word Segmenter for Sighan Bakeoff 2005 , 2005, IJCNLP.

[18]  Zhaohui Wu,et al.  Knowledge discovery in traditional Chinese medicine: State of the art and perspectives , 2006, Artif. Intell. Medicine.

[19]  Hyoil Han,et al.  Approaches to text mining for clinical medical records , 2006, SAC '06.

[20]  George Hripcsak,et al.  A temporal constraint structure for extracting temporal information from clinical narrative , 2006, J. Biomed. Informatics.

[21]  Fabio Crestani,et al.  Proceedings of the 2006 ACM symposium on Applied computing , 2006 .

[22]  Hisham M. Haddad Proceedings of the 2006 ACM symposium on Applied computing , 2006, SAC.

[23]  Paul M. B. Vitányi,et al.  The Google Similarity Distance , 2004, IEEE Transactions on Knowledge and Data Engineering.

[24]  K. Hoogenberg,et al.  Computerized extraction of information on the quality of diabetes care from free text in electronic patient records of general practitioners. , 2007, Journal of the American Medical Informatics Association : JAMIA.

[25]  Oren Etzioni,et al.  Open Information Extraction from the Web , 2007, CACM.

[26]  Daniel Jurafsky,et al.  Discriminative Reordering with Chinese Grammatical Relations Features , 2009, SSST@HLT-NAACL.

[27]  Özlem Uzuner,et al.  Extracting medication information from clinical text , 2010, J. Am. Medical Informatics Assoc..

[28]  Dekai Wu Proceedings of the 4th Workshop on Syntax and Structure in Statistical Translation , 2010, SSST@COLING.

[29]  Kazuhiko Ohe,et al.  Extraction of Adverse Drug Effects from Clinical Records , 2010, MedInfo.

[30]  Baoyan Liu,et al.  Development of traditional Chinese medicine clinical data warehouse for medical knowledge discovery and decision support , 2010, Artif. Intell. Medicine.

[31]  Min Li,et al.  High accuracy information extraction of medication information from clinical notes: 2009 i2b2 medication extraction challenge , 2010, J. Am. Medical Informatics Assoc..

[32]  Estevam R. Hruschka,et al.  Toward an Architecture for Never-Ending Language Learning , 2010, AAAI.

[33]  Hua Xu,et al.  A study of machine-learning-based approaches to extract clinical entities and their assertions from discharge summaries , 2011, J. Am. Medical Informatics Assoc..

[34]  Nate Blaylock,et al.  A corpus of clinical narratives annotated with temporal information , 2012, IHI '12.

[35]  John Shawe-Taylor,et al.  Extracting Diagnoses and Investigation Results from Unstructured Text in Electronic Health Records by Semi-Supervised Machine Learning , 2012, PloS one.

[36]  Christopher C. Yang,et al.  Proceedings of the 2nd ACM SIGHIT International Health Informatics Symposium , 2012, IHI 2012.

[37]  Haixun Wang,et al.  Probase: a probabilistic taxonomy for text understanding , 2012, SIGMOD Conference.

[38]  Baoyan Liu,et al.  Data processing and analysis in real‐world traditional Chinese medicine clinical data: challenges and approaches , 2012, Statistics in medicine.

[39]  Hua Xu,et al.  Clinical entity recognition using structural support vector machines with rich features , 2012, DTMBIO '12.

[40]  Joshua C. Denny,et al.  Detecting temporal expressions in medical narratives , 2013, Int. J. Medical Informatics.

[41]  Hongfang Liu,et al.  Research and applications: MedXN: an open source medication extraction and normalization tool for clinical text , 2014, J. Am. Medical Informatics Assoc..

[42]  Hua Xu,et al.  Research and applications: A comprehensive study of named entity recognition in Chinese clinical text , 2014, J. Am. Medical Informatics Assoc..

[43]  Lei Liu,et al.  Extracting important information from Chinese Operation Notes with natural language processing methods , 2014, J. Biomed. Informatics.

[44]  Sophia Ananiadou,et al.  Proceedings of the ACM Ninth International Workshop on Data and Text Mining in Biomedical Informatics , 2012, DTMBIO@CIKM.