A deep learning model incorporating part of speech and self-matching attention for named entity recognition of Chinese electronic medical records

BackgroundThe Named Entity Recognition (NER) task as a key step in the extraction of health information, has encountered many challenges in Chinese Electronic Medical Records (EMRs). Firstly, the casual use of Chinese abbreviations and doctors’ personal style may result in multiple expressions of the same entity, and we lack a common Chinese medical dictionary to perform accurate entity extraction. Secondly, the electronic medical record contains entities from a variety of categories of entities, and the length of those entities in different categories varies greatly, which increases the difficult in the extraction for the Chinese NER. Therefore, the entity boundary detection becomes the key to perform accurate entity extraction of Chinese EMRs, and we need to develop a model that supports multiple length entity recognition without relying on any medical dictionary.MethodsIn this study, we incorporate part-of-speech (POS) information into the deep learning model to improve the accuracy of Chinese entity boundary detection. In order to avoid the wrongly POS tagging of long entities, we proposed a method called reduced POS tagging that reserves the tags of general words but not of the seemingly medical entities. The model proposed in this paper, named SM-LSTM-CRF, consists of three layers: self-matching attention layer – calculating the relevance of each character to the entire sentence; LSTM (Long Short-Term Memory) layer – capturing the context feature of each character; CRF (Conditional Random Field) layer – labeling characters based on their features and transfer rules.ResultsThe experimental results at a Chinese EMRs dataset show that the F1 value of SM-LSTM-CRF is increased by 2.59% compared to that of the LSTM-CRF. After adding POS feature in the model, we get an improvement of about 7.74% at F1. The reduced POS tagging reduces the false tagging on long entities, thus increases the F1 value by 2.42% and achieves an F1 score of 80.07%.ConclusionsThe POS feature marked by the reduced POS tagging together with self-matching attention mechanism puts a stranglehold on entity boundaries and has a good performance in the recognition of clinical entities.

[1]  Jianwei Liu,et al.  Named Entity Recognition in Chinese Electronic Medical Records Based on CRF , 2017, 2017 14th Web Information Systems and Applications Conference (WISA).

[2]  Hua Xu,et al.  Research and applications: A comprehensive study of named entity recognition in Chinese clinical text , 2014, J. Am. Medical Informatics Assoc..

[3]  Eduard H. Hovy,et al.  End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF , 2016, ACL.

[4]  Wei Xu,et al.  Bidirectional LSTM-CRF Models for Sequence Tagging , 2015, ArXiv.

[5]  Ming Zhou,et al.  Gated Self-Matching Networks for Reading Comprehension and Question Answering , 2017, ACL.

[6]  Xiaolong Wang,et al.  Chinese Clinical Entity Recognition via Attention-Based CNN-LSTM-CRF , 2018, 2018 IEEE International Conference on Healthcare Informatics Workshop (ICHI-W).

[7]  Hua Xu,et al.  Named Entity Recognition in Chinese Clinical Text Using Deep Neural Network , 2015, MedInfo.

[8]  Zuofeng Li,et al.  Exploring N-gram Character Presentation in Bidirectional RNN-CRF for Chinese Clinical Named Entity Recognition , 2017 .

[9]  David Sontag,et al.  Learning a Health Knowledge Graph from Electronic Medical Records , 2017, Scientific Reports.

[10]  Le Sun,et al.  Early results for Chinese named entity recognition using conditional random fields model, HMM and maximum entropy , 2005, 2005 International Conference on Natural Language Processing and Knowledge Engineering.

[11]  Yuan Zhang,et al.  Construction of traditional Chinese medicine Knowledge Graph using Data Mining and Expert Knowledge , 2018, 2018 International Conference on Network Infrastructure and Digital Content (IC-NIDC).

[12]  Bowen Zhou,et al.  Improved Representation Learning for Question Answer Matching , 2016, ACL.

[13]  Phil Blunsom,et al.  Reasoning about Entailment with Neural Attention , 2015, ICLR.

[14]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[15]  Yi Qian,et al.  Joint segmentation and named entity recognition using dual decomposition in Chinese discharge summaries. , 2014, Journal of the American Medical Informatics Association : JAMIA.

[16]  Chen Ying Intelligent Recognition of Named Entity in Electronic Medical Records , 2011 .

[17]  Sanda M. Harabagiu,et al.  Automatic Generation of a Qualified Medical Knowledge Graph and Its Usage for Retrieving Patient Cohorts from Electronic Medical Records , 2013, 2013 IEEE Seventh International Conference on Semantic Computing.

[18]  Sam Coope,et al.  Neural Named Entity Recognition Using a Self-Attention Mechanism , 2017, 2017 IEEE 29th International Conference on Tools with Artificial Intelligence (ICTAI).

[19]  Shuohang Wang,et al.  Machine Comprehension Using Match-LSTM and Answer Pointer , 2016, ICLR.

[20]  Shang Gao,et al.  Improved deep belief network model and its application in named entity recognition of Chinese electronic medical records , 2018, 2018 IEEE 3rd International Conference on Big Data Analysis (ICBDA).