Any Language Early Detection of Epidemic Diseases from Web News Streams

In this paper, we introduce a multilingual epidemiological news surveillance system. Its main contribution is its ability to extract epidemic events in any language, hence succeeding where state-of-the-art in surveillance systems usually fails : the objective of reactivity. Most systems indeed focus on a selected list of languages, deemed important. However, evidence shows that events are first described in the local language, and translated to other languages later, if and only if they contained important information. Hence, while systems handling only a sample of human languages may indeed succeed at extracting epidemic events, they will only do so after someone else detected the importance of the news, and made the decision to translate it. Thus, with events first described in other languages, such automated systems, that may only detect events that were already detected by humans, are essentially irrelevant for early detection. To overcome this weakness of the state-of-the-art in terms of reactivity, we designed a system that can detect epidemiological events in any language, without requiring any translation, be it automated or human-written. The solution presented in this paper relies on properties that may be called language universals. First, we observe and exploit properties of the news genre that remain unchanged, whatever the writing language. Second, we handle language variations, such as declensions, by processing text at the character-level, rather than at the word level. This additionally allows to handle various writing systems in a similar fashion. We present experiments with 5 languages, steoreotypical of different language families and writing systems : English, Chinese, Greek, Polish and Russian. Our system, DAnIEL, achieves an average F-measure score around 85%, slightly below top-performing systems for the languages that such systems are able to handle. However, its performance is superior for morphologically-rich languages. And it performs of course infinitely better for the languages that other systems are not able to handle : The richest system in the state-of-the-art handles around 10 languages, while there exists about 6,000 languages in the world, 300 of which are spoken by more than one million people. The DAnIEL system is able to process each of them.

[1]  Antoine Doucet,et al.  Added-Value of Automatic Multilingual Text Analysis for Epidemic Surveillance , 2013, AIME.

[2]  Peter Sanders,et al.  Linear work suffix array construction , 2006, JACM.

[3]  Ralf Steinberger,et al.  A survey of methods to ease the development of highly multilingual text mining applications , 2011, Language Resources and Evaluation.

[4]  Arto Vihavainen,et al.  Relevance Prediction in Information Extraction using Discourse and Lexical Features , 2011, NODALIDA.

[5]  Kenneth Ward Church Empirical Estimates of Adaptation: The chance of Two Noriegas is closer to p/2 than p2 , 2000, COLING.

[6]  Antoine Doucet,et al.  DAnIEL: Language Independent Character-Based News Surveillance , 2012, JapTAL.

[7]  Son Doan,et al.  Global Health Monitor - A Web-based System for Detecting and Mapping Infectious Diseases , 2019, IJCNLP.

[8]  John S. Brownstein,et al.  The Landscape of International Biosurveillance , 2010 .

[9]  Jakub Piskorski,et al.  On Refining Real-Time Multilingual News Event Extraction through Deployment of Cross-Lingual Information Fusion Techniques , 2011, 2011 European Intelligence and Security Informatics Conference.

[10]  Mikhail Kopotev,et al.  Building Support Tools for Russian-Language Information Extraction , 2011, TSD.

[11]  A Lyon,et al.  Comparison of web-based biosecurity intelligence systems: BioCaster, EpiSPIDER and HealthMap. , 2012, Transboundary and emerging diseases.

[12]  Oren Etzioni,et al.  Open Information Extraction: The Second Generation , 2011, IJCAI.

[13]  Heng Ji,et al.  Challenges from Information Extraction to Information Fusion , 2010, COLING.

[14]  Nigel Collier,et al.  Towards cross-lingual alerting for bursty epidemic events , 2011, Semantic Mining in Biomedicine.

[15]  Nigel Collier,et al.  A multilingual ontology for infectious disease surveillance: rationale, design and challenges , 2007, Lang. Resour. Evaluation.

[16]  Ralf Steinberger,et al.  Text Mining from the Web for Medical Intelligence , 2007, NATO ASI Mining Massive Data Sets for Security.

[17]  J. Linge,et al.  Internet surveillance systems for early alerting of health threats. , 2009, Euro surveillance : bulletin Europeen sur les maladies transmissibles = European communicable disease bulletin.

[18]  Stylistic devices in news, as related to topic recognition , 2012 .

[19]  Ralph Grishman,et al.  Using Document Level Cross-Event Inference to Improve Event Extraction , 2010, ACL.

[20]  Jerry R. Hobbs The Generic Information Extraction System , 1993, MUC.

[21]  Esko Ukkonen,et al.  Maximal and minimal representations of gapped and non-gapped motifs of a string , 2009, Theor. Comput. Sci..