Background In recent years, there has been a growth in work on the use of information extraction technologies for tracking disease outbreaks from online news texts, yet publicly available evaluation standards (and associated resources) for this new area of research have been noticeably lacking. Objective This study seeks to create a “gold standard” data set against which to test how accurately disease outbreak information extraction systems can identify the semantics of disease outbreak events. Additionally, we hope that the provision of an annotation scheme (and associated corpus) to the community will encourage open evaluation in this new and growing application area. Methods We developed an annotation scheme for identifying infectious disease outbreak events in news texts. An event─in the context of our annotation scheme─consists minimally of geographical (eg, country and province) and disease name information. However, the scheme also allows for the rich encoding of other domain salient concepts (eg, international travel, species, and food contamination). Results The work resulted in a 200-document corpus of event-annotated disease outbreak reports that can be used to evaluate the accuracy of event detection algorithms (in this case, for the BioCaster biosurveillance online news information extraction system). In the 200 documents, 394 distinct events were identified (mean 1.97 events per document, range 0-25 events per document). We also provide a download script and graphical user interface (GUI)-based event browsing software to facilitate corpus exploration. Conclusion In summary, we present an annotation scheme and corpus that can be used in the evaluation of disease outbreak event extraction algorithms. The annotation scheme and corpus were designed both with the particular evaluation requirements of the BioCaster system in mind as well as the wider need for further evaluation resources in this growing research area.
[1]
Jean Carletta,et al.
Assessing Agreement on Classification Tasks: The Kappa Statistic
,
1996,
CL.
[2]
Kenneth D. Mandl,et al.
HealthMap: Global Infectious Disease Monitoring through Automated Classification and Visualization of Internet Media Reports
,
2008,
Journal of the American Medical Informatics Association.
[3]
Son Doan,et al.
BioCaster: detecting public health rumors with a Web-based text mining system
,
2008,
Bioinform..
[4]
Jari Björne,et al.
BioInfer: a corpus for information extraction in the biomedical domain
,
2007,
BMC Bioinformatics.
[5]
Herman D. Tolentino,et al.
Use of Unstructured Event-Based Reports for Global Infectious Disease Surveillance
,
2009,
Emerging infectious diseases.
[6]
Hagit Shatkay,et al.
New directions in biomedical text annotation: definitions, guidelines and corpus construction
,
2006,
BMC Bioinformatics.
[7]
Limsoon Wong,et al.
Accomplishments and challenges in literature data mining for biology
,
2002,
Bioinform..
[8]
G. Eysenbach.
Infodemiology and Infoveillance: Framework for an Emerging Set of Public Health Informatics Methods to Analyze Search, Communication and Publication Behavior on the Internet
,
2009,
Journal of medical Internet research.