An automatic approach for constructing a knowledge base of symptoms in Chinese

While a large number of well-known knowledge bases (KBs) in life science have been published as Linked Open Data, there are few KBs in Chinese. However, KBs of life science in Chinese are necessary when we want to automatically process and analyze electronic medical records (EMRs) in Chinese. Of all, the symptom KB in Chinese is the most seriously in need, since symptoms are the starting point of clinical diagnosis. Furthermore, expressions used in describing symptoms in clinical practice are diverse, which makes it hard to collect such a KB. In this paper, we publish a public KB of symptoms in Chinese. The KB is constructed by fusing data automatically extracted from eight mainstream healthcare websites, three Chinese encyclopedia sites, and symptoms extracted from a large number of EMRs as supplements. As a result, the KB has more than 26,000 distinct symptoms in Chinese including 3,968 symptoms in traditional Chinese medicine (TCM) and 1,029 synonym pairs for symptoms. The KB also includes concepts such as diseases and medicines as well as relations between symptoms and the above related entities. We also link our KB to the Unified Medical Language System (UMLS) and analyze the differences between symptoms in the two KBs. We released the KB as Linked Open Data and a demo at https://datahub.io/dataset/symptoms-in-chinese.

[1]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[2]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..

[3]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[4]  Kevin Donnelly,et al.  SNOMED-CT: The advanced terminology and coding system for eHealth. , 2006, Studies in health technology and informatics.

[5]  Gerhard Weikum,et al.  WWW 2007 / Track: Semantic Web Session: Ontologies ABSTRACT YAGO: A Core of Semantic Knowledge , 2022 .

[6]  Jens Lehmann,et al.  DBpedia - A crystallization point for the Web of Data , 2009, J. Web Semant..

[7]  Oren Etzioni,et al.  What Is This, Anyway: Automatic Hypernym Discovery , 2009, AAAI Spring Symposium: Learning by Reading and Learning to Read.

[8]  Gerhard Weikum,et al.  SOFIE: a self-organizing framework for information extraction , 2009, WWW '09.

[9]  Daniel Sonntag,et al.  Representing the International Classification of Diseases Version 10 in OWL , 2010, KEOD.

[10]  Estevam R. Hruschka,et al.  Toward an Architecture for Never-Ending Language Learning , 2010, AAAI.

[11]  David Gomez-Cabrero,et al.  ParkDB: a Parkinson’s disease gene expression database , 2011, Database J. Biol. Databases Curation.

[12]  Ravi Kumar,et al.  Automatic Wrappers for Large Scale Web Extraction , 2011, Proc. VLDB Endow..

[13]  Jason H. Moore,et al.  Mining the diseasome , 2011, BioData Mining.

[14]  Gerhard Weikum,et al.  Scalable knowledge harvesting with high precision and high recall , 2011, WSDM '11.

[15]  Asad U. Khan,et al.  AMDD: Antimicrobial Drug Database , 2012, Genom. Proteom. Bioinform..

[16]  Li Chen,et al.  A Preliminary Work on Symptom Name Recognition from Free-Text Clinical Records of Traditional Chinese Medicine using Conditional Random Fields and Reasonable Features , 2012, BioNLP@HLT-NAACL.

[17]  David S. Wishart,et al.  DrugBank 4.0: shedding new light on drug metabolism , 2013, Nucleic Acids Res..

[18]  Gerhard Weikum,et al.  KnowLife: a versatile approach for constructing a large knowledge graph for biomedical sciences , 2015, BMC Bioinformatics.

[19]  Julio Santisteban,et al.  Unilateral Jaccard Similarity Coefficient , 2015, GSB@SIGIR.

[20]  Maureen Stolzer,et al.  Event inference in multidomain families with phylogenetic reconciliation , 2015, BMC Bioinformatics.

[21]  Sophia Ananiadou,et al.  Bilingual term alignment from comparable corpora in English discharge summary and Chinese discharge summary , 2015, BMC Bioinformatics.

[22]  Michel Dumontier,et al.  Toward a complete dataset of drug-drug interaction information from publicly available sources , 2015, J. Biomed. Informatics.

[23]  Jeff Z. Pan,et al.  Effective Online Knowledge Graph Fusion , 2015, International Semantic Web Conference.

[24]  Tudor I. Oprea,et al.  ChemProt-3.0: a global chemical biology diseases mapping , 2016, Database J. Biol. Databases Curation.

[25]  Peer Bork,et al.  The SIDER database of drugs and side effects , 2015, Nucleic Acids Res..