An automatic approach for constructing a knowledge base of symptoms in Chinese

BackgroundWhile a large number of well-known knowledge bases (KBs) in life science have been published as Linked Open Data, there are few KBs in Chinese. However, KBs in Chinese are necessary when we want to automatically process and analyze electronic medical records (EMRs) in Chinese. Of all, the symptom KB in Chinese is the most seriously in need, since symptoms are the starting point of clinical diagnosis.ResultsWe publish a public KB of symptoms in Chinese, including symptoms, departments, diseases, medicines, and examinations as well as relations between symptoms and the above related entities. To the best of our knowledge, there is no such KB focusing on symptoms in Chinese, and the KB is an important supplement to existing medical resources. Our KB is constructed by fusing data automatically extracted from eight mainstream healthcare websites, three Chinese encyclopedia sites, and symptoms extracted from a larger number of EMRs as supplements.MethodsFirstly, we design data schema manually by reference to the Unified Medical Language System (UMLS). Secondly, we extract entities from eight mainstream healthcare websites, which are fed as seeds to train a multi-class classifier and classify entities from encyclopedia sites and train a Conditional Random Field (CRF) model to extract symptoms from EMRs. Thirdly, we fuse data to solve the large-scale duplication between different data sources according to entity type alignment, entity mapping, and attribute mapping. Finally, we link our KB to UMLS to investigate similarities and differences between symptoms in Chinese and English.ConclusionsAs a result, the KB has more than 26,000 distinct symptoms in Chinese including 3968 symptoms in traditional Chinese medicine and 1029 synonym pairs for symptoms. The KB also includes concepts such as diseases and medicines as well as relations between symptoms and the above related entities. We also link our KB to the Unified Medical Language System and analyze the differences between symptoms in the two KBs. We released the KB as Linked Open Data and a demo at https://datahub.io/dataset/symptoms-in-chinese.

[1]  Gerhard Weikum,et al.  WWW 2007 / Track: Semantic Web Session: Ontologies ABSTRACT YAGO: A Core of Semantic Knowledge , 2022 .

[2]  Jeff Z. Pan,et al.  Effective Online Knowledge Graph Fusion , 2015, International Semantic Web Conference.

[3]  Oren Etzioni,et al.  What Is This, Anyway: Automatic Hypernym Discovery , 2009, AAAI Spring Symposium: Learning by Reading and Learning to Read.

[4]  Michel Dumontier,et al.  Toward a complete dataset of drug-drug interaction information from publicly available sources , 2015, J. Biomed. Informatics.

[5]  Tudor I. Oprea,et al.  ChemProt-3.0: a global chemical biology diseases mapping , 2016, Database J. Biol. Databases Curation.

[6]  Li Chen,et al.  A Preliminary Work on Symptom Name Recognition from Free-Text Clinical Records of Traditional Chinese Medicine using Conditional Random Fields and Reasonable Features , 2012, BioNLP@HLT-NAACL.

[7]  Kevin Donnelly,et al.  SNOMED-CT: The advanced terminology and coding system for eHealth. , 2006, Studies in health technology and informatics.

[8]  Piero Fariselli,et al.  Blurring contact maps of thousands of proteins: what we can learn by reconstructing 3D structure , 2011, BioData Mining.

[9]  Zhongjun He Baidu Translate: Research and Products , 2015, HyTra@ACL.

[10]  Daniel Sonntag,et al.  Representing the International Classification of Diseases Version 10 in OWL , 2010, KEOD.

[11]  Asad U. Khan,et al.  AMDD: Antimicrobial Drug Database , 2012, Genom. Proteom. Bioinform..

[12]  Julio Santisteban,et al.  Unilateral Jaccard Similarity Coefficient , 2015, GSB@SIGIR.

[13]  David Gomez-Cabrero,et al.  ParkDB: a Parkinson’s disease gene expression database , 2011, Database J. Biol. Databases Curation.

[14]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..

[15]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[16]  Peer Bork,et al.  The SIDER database of drugs and side effects , 2015, Nucleic Acids Res..

[17]  Yang Li,et al.  Evaluating and Comparing Web-Scale Extracted Knowledge Bases in Chinese and English , 2015, JIST.

[18]  Sophia Ananiadou,et al.  Bilingual term alignment from comparable corpora in English discharge summary and Chinese discharge summary , 2015, BMC Bioinformatics.

[19]  Jens Lehmann,et al.  DBpedia - A crystallization point for the Web of Data , 2009, J. Web Semant..

[20]  Gerhard Weikum,et al.  Scalable knowledge harvesting with high precision and high recall , 2011, WSDM '11.

[21]  Jason H. Moore,et al.  Mining the diseasome , 2011, BioData Mining.

[22]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[23]  Gerhard Weikum,et al.  KnowLife: a versatile approach for constructing a large knowledge graph for biomedical sciences , 2015, BMC Bioinformatics.

[24]  Ravi Kumar,et al.  Automatic Wrappers for Large Scale Web Extraction , 2011, Proc. VLDB Endow..

[25]  Gerhard Weikum,et al.  SOFIE: a self-organizing framework for information extraction , 2009, WWW '09.

[26]  David S. Wishart,et al.  DrugBank 4.0: shedding new light on drug metabolism , 2013, Nucleic Acids Res..

[27]  Estevam R. Hruschka,et al.  Toward an Architecture for Never-Ending Language Learning , 2010, AAAI.