An Empirical Investigation on Fine-Grained Syndrome Segmentation in TCM by Learning a CRF from a Noisy Labeled Data

Syndrome is an important component in Traditional Chinese Medicine (TCM), and it is also a distinctive concept in TCM compared with Western Medicine (WM). Clearly understand the TCM syndrome help researchers digest TCM regularities and bridge TCM and WM. Syndromes are often used in coarse-grained forms, however fine-grained medical information buried in the coarse-grained TCM syndromes would not be considered. In this paper, we empirically investigate Fine-Grained Syndrome Segmentation (FGSS) in TCM by a distantly supervised method to build a noisy labeled data for training CRFs for FGSS in TCM. The feasibility and effectiveness of the method are demonstrated based on a series of elaborate experiments. The best F1-score can reach 0.9177. To the best of our knowledge, our work is the first to focus on finegrained information extraction in Chinese medical texts.

[1]  Yonghong Peng,et al.  Text mining for traditional Chinese medical knowledge discovery: A survey , 2010, J. Biomed. Informatics.

[2]  Naoaki Okazaki,et al.  Named entity recognition with multiple segment representations , 2013, Inf. Process. Manag..

[3]  Isabel Trancoso,et al.  Graph-based Semi-Supervised Model for Joint Chinese Word Segmentation and Part-of-Speech Tagging , 2013, ACL.

[4]  Zhaohui Wu,et al.  Knowledge discovery in traditional Chinese medicine: State of the art and perspectives , 2006, Artif. Intell. Medicine.

[5]  Dale Schuurmans,et al.  Semi-Supervised Conditional Random Fields for Improved Sequence Segmentation and Labeling , 2006, ACL.

[6]  Yongchao Liu,et al.  A framework and its empirical study of automatic diagnosis of traditional Chinese medicine utilizing raw free-text clinical records , 2012, J. Biomed. Informatics.

[7]  Huajun Chen,et al.  Modern bioinformatics meets traditional Chinese medicine , 2014, Briefings Bioinform..

[8]  Aiping Lu,et al.  Syndrome differentiation in modern research of traditional Chinese medicine. , 2012, Journal of ethnopharmacology.

[9]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[10]  Fernando Pereira,et al.  Shallow Parsing with Conditional Random Fields , 2003, NAACL.

[11]  Nadia Magnenat-Thalmann,et al.  Enhancing naive bayes with various smoothing methods for short text classification , 2012, WWW.

[12]  Andrew McCallum,et al.  Maximum Entropy Markov Models for Information Extraction and Segmentation , 2000, ICML.

[13]  Weiwei Sun,et al.  Enhancing Chinese Word Segmentation Using Unlabeled Data , 2011, EMNLP.

[14]  Jianxin Chen,et al.  Bridge the gap between syndrome in Traditional Chinese Medicine and proteome in western medicine by unsupervised pattern discovery algorithm , 2008, 2008 IEEE International Conference on Networking, Sensing and Control.

[15]  Yingying Wen,et al.  A compression based algorithm for Chinese word segmentation , 2000, CL.

[16]  Nianwen Xue,et al.  Chinese Word Segmentation as LMR Tagging , 2003, SIGHAN.

[17]  Hai Zhao,et al.  A Unified Character-Based Tagging Framework for Chinese Word Segmentation , 2010, TALIP.

[18]  Li Chen,et al.  A Preliminary Work on Symptom Name Recognition from Free-Text Clinical Records of Traditional Chinese Medicine using Conditional Random Fields and Reasonable Features , 2012, BioNLP@HLT-NAACL.

[19]  Hai Zhao,et al.  An Improved Chinese Word Segmentation System with Conditional Random Field , 2006, SIGHAN@COLING/ACL.

[20]  Fan Yang,et al.  Reliable Multi-Label Learning via Conformal Predictor and Random Forest for Syndrome Differentiation of Chronic Fatigue in Traditional Chinese Medicine , 2014, PloS one.