Multilingual chief complaint classification for syndromic surveillance: An experiment with Chinese chief complaints

Abstract Purpose Syndromic surveillance is aimed at early detection of disease outbreaks. An important data source for syndromic surveillance is free-text chief complaints (CCs), which may be recorded in different languages. For automated syndromic surveillance, CCs must be classified into predefined syndromic categories to facilitate subsequent data aggregation and analysis. Despite the fact that syndromic surveillance is largely an international effort, existing CC classification systems do not provide adequate support for processing CCs recorded in non-English languages. This paper reports a multilingual CC classification effort, focusing on CCs recorded in Chinese. Methods We propose a novel Chinese CC classification system leveraging a Chinese-English translation module and an existing English CC classification approach. A set of 470 Chinese key phrases was extracted from about one million Chinese CC records using statistical methods. Based on the extracted key phrases, the system translates Chinese text into English and classifies the translated CCs to syndromic categories using an existing English CC classification system. Results Compared to alternative approaches using a bilingual dictionary and a general-purpose machine translation system, our approach performs significantly better in terms of positive predictive value (PPV or precision), sensitivity (recall), specificity, and F measure (the harmonic mean of PPV and sensitivity), based on a computational experiment using real-world CC records. Conclusions Our design provides satisfactory performance in classifying Chinese CCs into syndromic categories for public health surveillance. The overall design of our system also points out a potentially fruitful direction for multilingual CC systems that need to handle languages beyond English and Chinese.

[1]  Jun Wang,et al.  Automatic Thesaurus Development : Term Extraction From Title Metadata , 2022 .

[2]  Ricardo Baeza-Yates,et al.  Information Retrieval: Data Structures and Algorithms , 1992 .

[3]  Alan Agresti,et al.  Categorical Data Analysis , 2003 .

[4]  Hsinchun Chen,et al.  Ontology-enhanced automatic chief complaint classification for syndromic surveillance , 2008, J. Biomed. Informatics.

[5]  Kam-Fai Wong,et al.  A Study on Word-Based and Integral-Bit Chinese Text Compression Algorithms , 1999, J. Am. Soc. Inf. Sci..

[6]  Robert T. Olszewski Bayesian Classification of Triage Diagnoses for the Early Detection of Epidemics , 2003, FLAIRS.

[7]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[8]  Ralf D. Brown,et al.  Example-Based Machine Translation in the Pangloss System , 1996, COLING.

[9]  Stefan Schulz,et al.  Biomedical information retrieval across languages , 2007, Medical informatics and the Internet in medicine.

[10]  Hsinchun Chen,et al.  Updateable PAT-Tree Approach to Chinese Key PhraseExtraction using Mutual Information: A Linguistic Foundation for Knowledge Management , 1999 .

[11]  Wendy W. Chapman,et al.  Research Paper: Generating a Reliable Reference Standard Set for Syndromic Case Classification , 2005, J. Am. Medical Informatics Assoc..

[12]  Zimin Wu,et al.  Chinese Text Segmentation for Text Retrieval: Achievements and Problems , 1993, J. Am. Soc. Inf. Sci..

[13]  Richard Sproat,et al.  A statistical method for finding word boundaries in Chinese text , 1990 .

[14]  Ophir Frieder,et al.  On bidirectional English-Arabic search , 2002, J. Assoc. Inf. Sci. Technol..

[15]  Michael M. Wagner,et al.  Accuracy of ICD-9-coded chief complaints and diagnoses for the detection of acute respiratory illness , 2001, AMIA.

[16]  van Gerardus Noord,et al.  Proceedings of the 23rd International Conference on Computational Linguistics (COLING 2010) , 2010 .

[17]  E. Fisher,et al.  The accuracy of Medicare's hospital claims data: progress has been made, but problems remain. , 1992, American journal of public health.

[18]  Douglas W. Oard,et al.  Adaptive vector space text filtering for monolingual and cross-language application , 1996 .

[19]  Brent King,et al.  A Bayesian model for triage decision support , 2006, Int. J. Medical Informatics.

[20]  Andrew W. Moore,et al.  Application of Information Technology: Automated Syndromic Surveillance for the 2002 Winter Olympics , 2003, J. Am. Medical Informatics Assoc..

[21]  Q. Mcnemar Note on the sampling error of the difference between correlated proportions or percentages , 1947, Psychometrika.

[22]  Susanne Heizmann,et al.  Review of Machine translation: an introductory guide by D. Arnold, L. Balkan, R. Lee Humphreys, S. Meijer, and L. Sadler. NCC Blackwell 1994. , 1995 .

[23]  J BELLIKA,et al.  ropagation of program control : A tool for istributed disease surveillance ohan , 2007 .

[24]  Hwee Tou Ng,et al.  Chinese Part-of-Speech Tagging: One-at-a-Time or All-at-Once? Word-Based or Character-Based? , 2004, EMNLP.

[25]  Lee-Feng Chien,et al.  PAT-tree-based adaptive keyphrase extraction for intelligent Chinese information retrieval , 1999, Inf. Process. Manag..

[26]  Wendy W. Chapman,et al.  Accuracy of three classifiers of acute gastrointestinal syndrome for syndromic surveillance , 2002, AMIA.

[27]  Christopher C. Yang,et al.  Combination and boundary detection approaches on Chinese indexing , 2000, J. Am. Soc. Inf. Sci..

[28]  Martti Juhola,et al.  Corpus-based cross-language information retrieval in retrieval of highly relevant documents , 2007, J. Assoc. Inf. Sci. Technol..

[29]  Gaston H. Gonnet,et al.  New Indices for Text: Pat Trees and Pat Arrays , 1992, Information Retrieval: Data Structures & Algorithms.

[30]  Yingying Wen,et al.  A compression based algorithm for Chinese word segmentation , 2000, CL.

[31]  Hsinchun Chen,et al.  Ontology-Based Automatic Chief Complaints Classification for Syndromic Surveillance , 2006, 2006 IEEE International Conference on Systems, Man and Cybernetics.

[32]  Stefan Schulz,et al.  Automatic lexeme acquisition for a multilingual medical subword thesaurus , 2007, Int. J. Medical Informatics.

[33]  Stephanie W. Haas,et al.  Evaluation of emergency medical text processor, a system for cleaning chief complaint text data. , 2004, Academic emergency medicine : official journal of the Society for Academic Emergency Medicine.

[34]  Doug Arnold,et al.  Machine Translation: An Introductory Guide , 1994 .

[35]  Yuji Matsumoto,et al.  Chinese Word Segmentation by Classification of Characters , 2005, Int. J. Comput. Linguistics Chin. Lang. Process..

[36]  Christopher C. Yang,et al.  Conceptual analysis of parallel corpus collected from the Web , 2006, J. Assoc. Inf. Sci. Technol..

[37]  Christopher C. Yang,et al.  Mining Web data for Chinese segmentation , 2007, J. Assoc. Inf. Sci. Technol..

[38]  Stephanie W. Haas,et al.  Using nurses' natural language entries to build a concept-oriented terminology for patients' chief complaints in the emergency department , 2003, J. Biomed. Informatics.

[39]  Tetsuya Sakai MT-based Japanese-Enlish cross-language IR experiments using the TREC test collections , 2000, IRAL '00.

[40]  B. Everitt,et al.  Statistical methods for rates and proportions , 1973 .

[41]  L. Hutwagner,et al.  The bioterrorism preparedness and response Early Aberration Reporting System (EARS) , 2003, Journal of Urban Health.

[42]  Yajiong Xue,et al.  Investigating public health emergency response information system initiatives in China , 2004, International Journal of Medical Informatics.

[43]  Frank C. Day,et al.  Automated linking of free-text complaints to reason-for-visit categories and International Classification of Diseases diagnoses in emergency department patient record databases. , 2004, Annals of emergency medicine.

[44]  Turid Hedlund,et al.  Dictionary-Based Cross-Language Information Retrieval: Problems, Methods, and Research Findings , 2001, Information Retrieval.

[45]  Peter J. Haug,et al.  Classifying free-text triage chief complaints into syndromic categories with natural language processing , 2005, Artif. Intell. Medicine.

[46]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[47]  Chwan-Chuen King,et al.  Establishing a nationwide emergency department-based syndromic surveillance system for better public health responses in Taiwan , 2008, BMC public health.

[48]  Ian H. Witten,et al.  Data Compression Using Adaptive Coding and Partial String Matching , 1984, IEEE Trans. Commun..

[49]  Gregory F. Cooper,et al.  SyCo: A Probabilistic Machine Learning Method for Classifying Chief Complaints into Symptom and Syndrome Categories , 2006 .

[50]  Hsinchun Chen,et al.  Multilingual Web retrieval: An experiment in English–Chinese business intelligence , 2006 .