A CRF-based Method for Automatic Construction of Chinese Symptom Lexicon

Lexicon plays a key role in Medical Language Processing (MLP) technology. Construction of semantic lexicon has become the prerequisite of MLP study in China where there are limited clinical terminology resources available. In this study, an iterative machine learning algorithm based on Conditional Random Field (CRF) was proposed aiming to automatically build a symptom lexicon from clinical corpus. Comprehensive evaluation was conducted in terms of exact and inexact for the algorithm. The algorithm achieved the performance, with F-measure of 87.23%, precision and recall were 99.95% and 72.23%, respectively. Furthermore, a lexicon which contained 22,501 symptoms was constructed based on this approach.

[1]  Lei Liu,et al.  Extracting important information from Chinese Operation Notes with natural language processing methods , 2014, J. Biomed. Informatics.

[2]  Amar K. Das,et al.  Unsupervised Method for Automatic Construction of a Disease Dictionary from a Large Free Text Collection , 2008, AMIA.

[3]  Stephen B. Johnson Research Paper: A Semantic Lexicon for Medical Language Processing , 1999, J. Am. Medical Informatics Assoc..

[4]  Robert Eriksson,et al.  Dictionary construction and identification of possible adverse drug events in Danish clinical narrative text , 2013, J. Am. Medical Informatics Assoc..

[5]  Louise Deléger,et al.  Evaluating the impact of pre-annotation on annotation speed and potential bias: natural language processing gold standard development for clinical named entity recognition in clinical trial announcements , 2013, J. Am. Medical Informatics Assoc..

[6]  Noémie Elhadad,et al.  Automated methods for the summarization of electronic health records , 2015, J. Am. Medical Informatics Assoc..

[7]  Smaranda Muresan,et al.  A Method for Automatically Building and Evaluating Dictionary Resources , 2002, LREC.

[8]  Smaranda Muresan,et al.  DEFINDER: Rule-based Methods for the Extraction of Medical Terminology and their Associated Definitions from On-line Text , 2000, AMIA.

[9]  S. Johnson A semantic lexicon for medical language processing. , 1999, Journal of the American Medical Informatics Association : JAMIA.

[10]  James Pustejovsky,et al.  Corpus processing for lexical acquisition , 1996 .

[11]  Hua Xu,et al.  Research and applications: A comprehensive study of named entity recognition in Chinese clinical text , 2014, J. Am. Medical Informatics Assoc..

[12]  Hongfang Liu,et al.  Research and applications: MedXN: an open source medication extraction and normalization tool for clinical text , 2014, J. Am. Medical Informatics Assoc..

[13]  Hua Xu,et al.  A study of machine-learning-based approaches to extract clinical entities and their assertions from discharge summaries , 2011, J. Am. Medical Informatics Assoc..

[14]  Fernando Pereira,et al.  Shallow Parsing with Conditional Random Fields , 2003, NAACL.

[15]  David Martínez,et al.  Evaluating the state of the art in disorder recognition and normalization of the clinical narrative , 2014, J. Am. Medical Informatics Assoc..

[16]  Ellen Riloff,et al.  Automatically Constructing a Dictionary for Information Extraction Tasks , 1993, AAAI.

[17]  Daowei Ma,et al.  A compact set with noncompact disc-hull , 2000 .

[18]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[19]  Amber Stubbs,et al.  MAE and MAI: Lightweight Annotation and Adjudication Tools , 2011, Linguistic Annotation Workshop.