SyCo: A Probabilistic Machine Learning Method for Classifying Chief Complaints into Symptom and Syndrome Categories

OBJECTIVE Design, build and evaluate a symptom-based probabilistic chief complaint classifier for the Realtime Outbreak and Disease Surveillance System (RODS). BACKGROUND Many free-text classification techniques have been employed in biosurveillance including keyword search, weighted keyword search, and naive Bayes. Both direct text-to-syndrome and text-tosymptom-to-syndrome classification approaches exist. The advantage of the latter approach is the ability to construct new syndrome classifiers from existing symptom classifiers. One approach to textto-symptom-to-syndrome classification uses manually weighted keyword search and Boolean operations to build syndrome classifiers. A limitation to this approach is that it does not address uncertainty in the data and the system is manually parameterized. A text-to-symptom-to-syndrome approach that is probabilistic and utilizes machine learning addresses these limitations. METHODS We constructed SyCo — a text-to-symptom-tosyndrome probabilistic chief complaint classifier. SyCo learns a Naive Bayes model of the relationship between words and symptoms given a training set of labeled chief complaints. To perform a classification, SyCo first computes the posterior probability of each symptom using Bayes rule. SyCo can optionally assume that a single word in the chief complaint will indicate the presence of a symptom of interest by only utilizing the word with the likelihood ratio of greatest magnitude along with the prior odds to calculate the posterior probability. We added this option to counter over fitting of the training data—words not normally associated with a symptom in the training set lower the probability that a chief complaint encodes a particular symptom causing misclassifications. Finally, SyCo uses the posterior probabilities from the first step to compute the posterior probability of a syndrome given a chief complaint. A syndrome is defined as any combination of symptom classes and Boolean operations. SyCo supports the operations AND, OR, and NOT. A board certified infectious disease physician [JD] read 16718 chief complaints and indicated the presence or absence of 17 symptoms for each chief complaint. We measured the performance of SyCo when classifying individual symptoms and three syndromes using leave-one-out cross validation. We measured the area under the curve (AUC) of the resultant receiver operator characteristic (ROC) curves. We measured 90% confidence intervals using 100 iterations of non-parametric bootstrapping. RESULTS The area under the curve for the individual symptoms (17) and example syndromes (3) without and with the single word assumption ranged from 0.785 to 0.9918 and 0.7442 to 0.9916, respectively. The single word assumption improved performance significantly in 6 out of 20 cases and did not degrade the performance significantly in any of the cases. CONCLUSION SyCo is a symptom-based probabilistic chief complaint classifier that has excellent discriminatory ability for classifying chief complaints into symptom categories and syndromes. SyCo is now available in RODS (Version 4.2).