Multilingual event extraction for epidemic detection

OBJECTIVE This paper presents a multilingual news surveillance system applied to tele-epidemiology. It has been shown that multilingual approaches improve timeliness in detection of epidemic events across the globe, eliminating the wait for local news to be translated into major languages. We present here a system to extract epidemic events in potentially any language, provided a Wikipedia seed for common disease names exists. METHODS The Daniel system presented herein relies on properties that are common to news writing (the journalistic genre), the most useful being repetition and saliency. Wikipedia is used to screen common disease names to be matched with repeated characters strings. Language variations, such as declensions, are handled by processing text at the character-level, rather than at the word level. This additionally makes it possible to handle various writing systems in a similar fashion. MATERIAL As no multilingual ground truth existed to evaluate the Daniel system, we built a multilingual corpus from the Web, and collected annotations from native speakers of Chinese, English, Greek, Polish and Russian, with no connection or interest in the Daniel system. This data set is available online freely, and can be used for the evaluation of other event extraction systems. RESULTS Experiments for 5 languages out of 17 tested are detailed in this paper: Chinese, English, Greek, Polish and Russian. The Daniel system achieves an average F-measure of 82% in these 5 languages. It reaches 87% on BEcorpus, the state-of-the-art corpus in English, slightly below top-performing systems, which are tailored with numerous language-specific resources. The consistent performance of Daniel on multiple languages is an important contribution to the reactivity and the coverage of epidemiological event detection systems. CONCLUSIONS Most event extraction systems rely on extensive resources that are language-specific. While their sophistication induces excellent results (over 90% precision and recall), it restricts their coverage in terms of languages and geographic areas. In contrast, in order to detect epidemic events in any language, the Daniel system only requires a list of a few hundreds of disease names and locations, which can actually be acquired automatically. The system can perform consistently well on any language, with precision and recall around 82% on average, according to this paper's evaluation. Daniel's character-based approach is especially interesting for morphologically-rich and low-resourced languages. The lack of resources to be exploited and the state of the art string matching algorithms imply that Daniel can process thousands of documents per minute on a simple laptop. In the context of epidemic surveillance, reactivity and geographic coverage are of primary importance, since no one knows where the next event will strike, and therefore in what vernacular language it will first be reported. By being able to process any language, the Daniel system offers unique coverage for poorly endowed languages, and can complete state of the art techniques for major languages.

[1]  Esko Ukkonen,et al.  Maximal and minimal representations of gapped and non-gapped motifs of a string , 2009, Theor. Comput. Sci..

[2]  Chang-Chuan Chan,et al.  Is the reporting timeliness gap for avian flu and H1N1 outbreaks in global health surveillance systems associated with country transparency? , 2013, Globalization and Health.

[3]  Jakub Piskorski,et al.  On Refining Real-Time Multilingual News Event Extraction through Deployment of Cross-Lingual Information Fusion Techniques , 2011, 2011 European Intelligence and Security Informatics Conference.

[4]  Son Doan,et al.  Global Health Monitor - A Web-based System for Detecting and Mapping Infectious Diseases , 2019, IJCNLP.

[5]  Peter Sanders,et al.  Linear work suffix array construction , 2006, JACM.

[6]  Mikhail Kopotev,et al.  Building Support Tools for Russian-Language Information Extraction , 2011, TSD.

[7]  Oren Etzioni,et al.  Open Information Extraction: The Second Generation , 2011, IJCAI.

[8]  Antoine Doucet,et al.  Any Language Early Detection of Epidemic Diseases from Web News Streams , 2013, 2013 IEEE International Conference on Healthcare Informatics.

[9]  Heng Ji,et al.  A toolkit for knowledge base population , 2011, SIGIR '11.

[10]  Jerry R. Hobbs The Generic Information Extraction System , 1993, MUC.

[11]  Nigel Collier,et al.  Towards cross-lingual alerting for bursty epidemic events , 2011, Semantic Mining in Biomedicine.

[12]  Antoine Doucet,et al.  Filtering news for epidemic surveillance: towards processing more languages with fewer resources , 2010 .

[13]  Nigel Collier,et al.  A multilingual ontology for infectious disease surveillance: rationale, design and challenges , 2007, Lang. Resour. Evaluation.

[14]  Bonnie L. Webber,et al.  Discourse Structure and Computation: Past, Present and Future , 2012, Discoveries@ACL.

[15]  Kenneth D. Mandl,et al.  HealthMap: Global Infectious Disease Monitoring through Automated Classification and Visualization of Internet Media Reports , 2008, Journal of the American Medical Informatics Association.

[16]  Son Doan,et al.  BioCaster: detecting public health rumors with a Web-based text mining system , 2008, Bioinform..

[17]  Kenneth Ward Church Empirical Estimates of Adaptation: The chance of Two Noriegas is closer to p/2 than p2 , 2000, COLING.

[18]  Mike Conway,et al.  Developing a Disease Outbreak Event Corpus , 2010, Journal of medical Internet research.

[19]  J. Brownstein,et al.  Evaluation of Epidemic Intelligence Systems Integrated in the Early Alerting and Reporting Project for the Detection of A/H5N1 Influenza Events , 2013, PloS one.

[20]  Herman D. Tolentino,et al.  Use of Unstructured Event-Based Reports for Global Infectious Disease Surveillance , 2009, Emerging infectious diseases.

[21]  Josef Steinberger,et al.  Multilingual Media Monitoring and Text Analysis - Challenges for Highly Inflected Languages , 2013, TSD.

[22]  Ralph Grishman,et al.  Using Document Level Cross-Event Inference to Improve Event Extraction , 2010, ACL.

[23]  Laura Inés Furlong,et al.  Assessment of NER solutions against the first and second CALBC Silver Standard Corpus , 2011, Semantic Mining in Biomedicine.

[24]  Antoine Doucet,et al.  Added-Value of Automatic Multilingual Text Analysis for Epidemic Surveillance , 2013, AIME.

[25]  Ralf Steinberger,et al.  A survey of methods to ease the development of highly multilingual text mining applications , 2011, Language Resources and Evaluation.

[26]  Stephen S Morse,et al.  Public health surveillance and infectious disease detection. , 2012, Biosecurity and bioterrorism : biodefense strategy, practice, and science.

[27]  Jakub Piskorski,et al.  Exploring the Usefulness of Cross-lingual Information Fusion for Refining Real-time News Event Extraction: A Preliminary Study , 2011, RANLP.