Syndromic surveillance requires the acquisition and analysis of data that may be “suggestive” of early epidemics in a community, long before there is any categorical evidence of unusual infection. These data are often heterogenous and often quite noisey. The processs of syndromic surveillance poses problems in data integration; in selection of appropriate reusable problem-solving methods, based on task features and on the nature of the data at hand; and in mapping integrated data to appropriate problem solvers. These are all tasks that have been studied carefully in the knowledge-based systems community for many years. We demonstrate how a software architecture that suppports knoweldge-based data integrationa and problem solving facilitates many aspects of the syndromic-surveillance task. In particular, we use reference ontologies for purposes of semantic integration and a parallelizable blackboard architecture for invocation of appropriate problem solving methods and for control of reasoning. We demonstrate our results in the context of a prototype system known as the Biological Spacio-Temporal Outbreak Reasoning Module (BioSTORM), which offers an end-to-end solution to the problem of syndromic surveillance. The New Trend: Syndromic Surveillance In recent years, public-health surveillance has become a priority for national security and public health, driven by fears of possible bioterrorist attacks. Authorities argue that early detection of nascent outbreaks through surveillance of “pre-diagnostic” data is crucial to prevent massive illness and death (Pavlin 1999). The need for improved surveillance and the increasing availability of electronic data have resulted in a blossoming of surveillance-system development (Bravata et al. 2004). Most recently developed systems use electronically available data and statistical analytic methods in an attempt to detect disease outbreaks rapidly. In general, the emphasis is on the interpretation of noisy, non-definitive data sources, such as diagnosis codes from emergency-room visits, reports of over-the-counter and prescription drug sales, reports of absenteeism, calls to medical advice personnel, and so on. For example, the Real-time Outbreak Detection System (RODS; Tsui et al. 2003) allows for automated transmission and analysis of administrative diagnostic codes and other data directly from hospital information systems at many emergency rooms in the greater Pittsburgh area. The Electronic Surveillance System for the Early Notification of Community-based Epidemics (ESSENCE; Lombardo et al. 2003) monitors disease codes assigned for outpatient visits by military personnel and their dependants across the United States and throughout the world. More recently, the CDC began development of the BioSense system to monitor data from many sources including DOD and VA facilities, laboratory systems, and over-the-counter pharmaceutical sales. By the summer of 2003, public health authorities had already deployed more than 100 different surveillance systems in the United States, all relying on electronically available data to detect disease outbreaks rapidly (Buehler et al. 2003). In most situations, surveillance data that are available electronically are not collected for the expressed purpose of monitoring the public’s health. Recently deployed surveillance systems tend to rely on data collected for administrative and business purposes. For example, many systems follow healthcare utilization records collected to enable billing, or pharmaceutical sales records collected for inventory and marketing purposes. Because these data sources are not collected with surveillance in mind, they often are biased in various ways. In addition, because public health agencies do not control the data collection, the data rarely conform to a standard format. Different data sources can represent the same concepts differently, and different data sources can also represent different concepts in a superficially similar manner. When incorporating data sources into a surveillance system, the differences in structure and concept representation must be reconciled. Semantic reconciliation is especially important so that analyses across data sources can integrate conceptually diverse data and can reason about those data in a consistent manner. Knowledge-Based Syndromic Surveillance To meet the complex operational and research needs of surveillance applications, we have developed a prototype system known as the Biological Spatio-Temporal Outbreak Reasoning Module (BioSTORM; Buckeridge et al. 2003). BioSTORM is a computational framework that brings together a variety of data sources and analytic problem solvers with the goal of meeting the performance demands of emerging disease-surveillance systems. The system addresses the following goals: (1) to acquire and curate data from diagnostic and pre-diagnostic sources; (2) to provide a knowledge-based infrastructure to integrate and experiment with alternative data sources and problem solvers; and (3) to support development and evaluation of problem solvers for temporal and spatial analysis. As shown in Figure 1, the BioSTORM system has four main components, each described in the remainder of this section: (1) a data-source ontology for describing the features of specific data sources and data streams to be used for analysis; (2) a library of statistical and knowledgebased problem solvers for analyzing biosurveillance data; (3) an intelligent mediation component that includes (a) a data broker to integrate multiple, related data sources that have been described in the data-sources ontology and (b) a mapping interpreter to connect the integrated data from the data broker to the problem solvers that can best analyze those data; and (4) a control structure, known as RASTA, that deploys the problem solvers on incoming streams of data. Figure 1. Overview of deployed BioSTORM components showing data being fed through the Data Broker and Mapping Interpreter to a set of problem-solving methods. The RASTA deployment controller orchestrates the deployment of problem-solving methods (PSMs) and the flow of data to those PSMs via the Data Broker and the Mapping Interpreter. The Data Source and Mapping Ontologies are used by the broker and mapping interpreter to construct semantically uniform streams of data for the deployed PSMs. The Method Ontology is used by RASTA to configure sets of PSMs into analytic strategies to perform analysis on those data streams. A Data-Source Ontology for Describing and Contextualizing Data Streams Public-health surveillance data are diverse and usually distributed in various databases and files with little common semantic or syntactic structure. Thus, these data can be difficult to represent in a way that enables their consistent analysis by reusable analytic methods. We have developed a data-sources ontology that provides a means for describing extremely diverse data in a coherent manner and that facilitates reasoning and processing of those data (Pincus and Musen 2003). Our ontology provides a domainindependent semantic structure for raw data to assist in the integration of data from disparate sources. More precisely, the data-sources ontology provides an approach to data integration that combines the semantic rigor of creating a global ontology with the flexibility and level of detail that comes from devising customized, local ontologies specifically for each data source of interest. The data-sources ontology aims to make data selfdescriptive by associating a structured context with each potential data source. A developer describes the context of data from a particular data source by filling in a template with relevant details about that data source. The datasources ontology provides a predefined taxonomy of data attributes to describe this context. To describe individual data elements, developers use terms that we have adopted from the Logical Identifier Names and Codes (LOINC). The LOINC approach, which clinical pathologists use to contextualize results reported by clinical laboratories, describes a piece of data along five major semantic axes. We have generalized the LOINC axes from their specific role in reporting clinical laboratory results to a generic set of descriptors for many different types of data. This systematic, template-directed process allows developers to create a customized local model of each data source that shares a common and consistent structure, space of attributes, and set of possible attribute values with all other similarly created models. The BioSTORM system processes the data-sources ontology to access relevant context information about incoming data streams to interpret and analyze those data appropriately. We have used the data-sources ontology successfully to develop descriptions for a number of data sources, including San Francisco Emergency 911 dispatch data and patient data from the Palo Alto VA medical center, as well as data related to reportable diseases, in collaboration with the CDC. The data-sources ontology was able to capture individual-level primitive data (e.g., signs and symptoms, laboratory tests) as well as observable population-level data (e.g., aggregated syndrome counts, school absenteeism). The ontology offers descriptions of generic data sources (e.g., 911 dispatch data), with instances of the generic descriptions that describe the specific data fields of particular Data Source
[1]
Nicholas Carriero,et al.
How to write parallel programs: a guide to the perplexed
,
1989,
CSUR.
[2]
David M. Hartley,et al.
Syndromic Surveillance and Bioterrorism-related Epidemics
,
2003,
Emerging infectious diseases.
[3]
Steffen Staab,et al.
International Handbooks on Information Systems
,
2013
.
[4]
Mark A. Musen,et al.
Ontologies in Support of Problem Solving
,
2004,
Handbook on Ontologies.
[5]
Joseph S. Lombardo,et al.
A systems overview of the Electronic Surveillance System for the Early Notification of Community-Based Epidemics (ESSENCE II)
,
2003,
Journal of Urban Health.
[6]
D. Buckeridge,et al.
Systematic Review: Surveillance Systems for Early Detection of Bioterrorism-Related Diseases
,
2004,
Annals of Internal Medicine.
[7]
Mark A. Musen,et al.
Contextualizing Heterogeneous Data for Integration and Inference
,
2003,
AMIA.
[8]
Michael M. Wagner,et al.
Technical Description of RODS: A Real-time Public Health Surveillance System
,
2003,
Journal of the American Medical Informatics Association.
[9]
J. Pavlin,et al.
Epidemiology of bioterrorism.
,
1999,
Emerging infectious diseases.
[10]
David L. Buckeridge,et al.
A Knowledge-Based Framework for Deploying Surveillance Problem Solvers
,
2004,
IKE.
[11]
Samson W. Tu,et al.
Mapping domains to methods in support of reuse
,
1994,
Int. J. Hum. Comput. Stud..
[12]
David L. Buckeridge,et al.
An Analytic Framework for Space-Time Aberrancy Detection in Public Health Surveillance Data
,
2003,
AMIA.