Joint segmentation and named entity recognition using dual decomposition in Chinese discharge summaries.

OBJECTIVE In this paper, we focus on three aspects: (1) to annotate a set of standard corpus in Chinese discharge summaries; (2) to perform word segmentation and named entity recognition in the above corpus; (3) to build a joint model that performs word segmentation and named entity recognition. DESIGN Two independent systems of word segmentation and named entity recognition were built based on conditional random field models. In the field of natural language processing, while most approaches use a single model to predict outputs, many works have proved that performance of many tasks can be improved by exploiting combined techniques. Therefore, in this paper, we proposed a joint model using dual decomposition to perform both the two tasks in order to exploit correlations between the two tasks. Three sets of features were designed to demonstrate the advantage of the joint model we proposed, compared with independent models, incremental models and a joint model trained on combined labels. MEASUREMENTS Micro-averaged precision (P), recall (R), and F-measure (F) were used to evaluate results. RESULTS The gold standard corpus is created using 336 Chinese discharge summaries of 71 355 words. The framework using dual decomposition achieved 0.2% improvement for segmentation and 1% improvement for recognition, compared with each of the two tasks alone. CONCLUSIONS The joint model is efficient and effective in both segmentation and recognition compared with the two individual tasks. The model achieved encouraging results, demonstrating the feasibility of the two tasks.

[1]  Alexander M. Rush,et al.  On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing , 2010, EMNLP.

[2]  Yuji Matsumoto,et al.  Dual decomposition method for chinese predicate-argument structure analysis , 2011, 2011 7th International Conference on Natural Language Processing and Knowledge Engineering.

[3]  Alexander M. Rush,et al.  Exact Decoding of Syntactic Translation Models through Lagrangian Relaxation , 2011, ACL.

[4]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[5]  Jun'ichi Tsujii,et al.  Incremental Joint POS Tagging and Dependency Parsing in Chinese , 2011, IJCNLP.

[6]  Lina Fatima Soualmia,et al.  Matching health information seekers' queries to medical terms , 2012, BMC Bioinformatics.

[7]  Hai Zhao,et al.  Approach to Chinese Word Segmentation Based on Character-Word Joint Decoding: Approach to Chinese Word Segmentation Based on Character-Word Joint Decoding , 2009 .

[8]  Alexander M. Rush,et al.  Dual Decomposition for Parsing with Non-Projective Head Automata , 2010, EMNLP.

[9]  Michael Collins,et al.  Exact Decoding of Phrase-Based Translation Models through Lagrangian Relaxation , 2011, EMNLP.

[10]  Changning Huang,et al.  Chinese Word Segmentation and Named Entity Recognition: A Pragmatic Approach , 2005, CL.

[11]  Andrew McCallum,et al.  Fast and Robust Joint Models for Biomedical Event Extraction , 2011, EMNLP.

[12]  Andrew McCallum,et al.  Model Combination for Event Extraction in BioNLP 2011 , 2011, BioNLP@ACL.

[13]  Jun'ichi Tsujii,et al.  Coordination Structure Analysis using Dual Decomposition , 2012, EACL.

[14]  Jun'ichi Tsujii,et al.  Named entity recognition of follow-up and time information in 20 000 radiology reports , 2012, J. Am. Medical Informatics Assoc..

[15]  Dan Roth,et al.  A Joint Model for Extended Semantic Role Labeling , 2011, EMNLP.

[16]  Song Yan,et al.  Approach to Chinese Word Segmentation Based on Character-Word Joint Decoding , 2009 .

[17]  Mihai Surdeanu,et al.  Event Extraction as Dependency Parsing for BioNLP 2011 , 2011, BioNLP@ACL.

[18]  Shuying Shen,et al.  2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text , 2011, J. Am. Medical Informatics Assoc..

[19]  Hai Zhao,et al.  An Improved Chinese Word Segmentation System with Conditional Random Field , 2006, SIGHAN@COLING/ACL.

[20]  Alejandro Rodríguez-Molinero,et al.  Functional assessment of older patients in the emergency department: comparison between standard instruments, medical records and physicians' perceptions , 2006, BMC geriatrics.

[21]  Ruchi Verma,et al.  A Support Vector Machine based method to distinguish proteobacterial proteins from eukaryotic plant proteins , 2012, BMC Bioinformatics.

[22]  Jun'ichi Tsujii,et al.  Feature engineering combined with machine learning and rule-based methods for structured information extraction from narrative clinical discharge summaries , 2012, J. Am. Medical Informatics Assoc..

[23]  Karen Spärck Jones Some Points in a Time , 2005, Computational Linguistics.

[24]  Hongfang Liu,et al.  Machine learning and word sense disambiguation in the biomedical domain: design and evaluation issues , 2006, BMC Bioinformatics.

[25]  Sonu Kumar,et al.  The G protein-coupled receptors in the pufferfish Takifugu rubripes , 2011, BMC Bioinformatics.

[26]  George B. Dantzig,et al.  Decomposition Principle for Linear Programs , 1960 .

[27]  Noah A. Smith,et al.  An Exact Dual Decomposition Algorithm for Shallow Semantic Parsing with Constraints , 2012, *SEMEVAL.

[28]  Jian-Tao Sun,et al.  Building Large Collections of Chinese and English Medical Terms from Semi-Structured and Encyclopedia Websites , 2013, PloS one.

[29]  Zhiyong Lu,et al.  A context-blocks model for identifying clinical relationships in patient records , 2011, BMC Bioinformatics.

[30]  Shuying Shen,et al.  Evaluating the state of the art in coreference resolution for electronic medical records , 2012, J. Am. Medical Informatics Assoc..

[31]  Hitoshi Isahara,et al.  An Error-Driven Word-Character Hybrid Model for Joint Chinese Word Segmentation and POS Tagging , 2009, ACL/IJCNLP.

[32]  Prakash M. Nadkarni,et al.  Overcoming barriers to NLP for clinical text: the role of shared tasks and the need for additional creative solutions , 2011, J. Am. Medical Informatics Assoc..

[33]  Stephen P. Boyd,et al.  Notes on Decomposition Methods , 2008 .

[34]  Andrew McCallum,et al.  Combining joint models for biomedical event extraction , 2012, BMC Bioinformatics.

[35]  Jun'ichi Tsujii,et al.  Joint segmentation and named entity recognition , 2013 .

[36]  Daniel Jurafsky,et al.  A Conditional Random Field Word Segmenter for Sighan Bakeoff 2005 , 2005, IJCNLP.

[37]  Özlem Uzuner,et al.  Extracting medication information from clinical text , 2010, J. Am. Medical Informatics Assoc..