EpiK: A Knowledge Base for Epidemiological Modeling and Analytics of Infectious Diseases

Computational epidemiology seeks to develop computational methods to study the distribution and determinants of health-related states or events (including disease), and the application of this study to the control of diseases and other health problems. Recent advances in computing and data sciences have led to the development of innovative modeling environments to support this important goal. The datasets used to drive the dynamic models as well as the data produced by these models presents unique challenges owing to their size, heterogeneity and diversity. These datasets form the basis of effective and easy to use decision support and analytical environments. As a result, it is important to develop scalable data management systems to store, manage and integrate these datasets. In this paper, we develop EpiK—a knowledge base that facilitates the development of decision support and analytical environments to support epidemic science. An important goal is to develop a framework that links the input as well as output datasets to facilitate effective spatio-temporal and social reasoning that is critical in planning and intervention analysis before and during an epidemic. The data management framework links modeling workflow data and its metadata using a controlled vocabulary. The metadata captures information about storage, the mapping between the linked model and the physical layout, and relationships to support services. EpiK is designed to support agent-based modeling and analytics frameworks—aggregate models can be seen as special cases and are thus supported. We use semantic web technologies to create a representation of the datasets that encapsulates both the location and the schema heterogeneity. The choice of RDF as a representation language is motivated by the diversity and growth of the datasets that need to be integrated. A query bank is developed—the queries capture a broad range of questions that can be posed and answered during a typical case study pertaining to disease outbreaks. The queries are constructed using SPARQL Protocol and RDF Query Language (SPARQL) over the EpiK. EpiK can hide schema and location heterogeneity while efficiently supporting queries that span the computational epidemiology modeling pipeline: from model construction to simulation output. We show that the performance of benchmark queries varies significantly with respect to the choice of hardware underlying the database and resource description framework (RDF) engine.

[1]  Marc Lipsitch,et al.  Improving the evidence base for decision making during a pandemic: the example of 2009 influenza A/H1N1. , 2011, Biosecurity and bioterrorism : biodefense strategy, practice, and science.

[2]  David L. Craft,et al.  Emergency response to a smallpox attack: The case for mass vaccination , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[3]  Madhav V. Marathe,et al.  Big Data Applications in Health Sciences and Epidemiology , 2015, Handbook of Statistics.

[4]  Herbert W. Hethcote,et al.  The Mathematics of Infectious Diseases , 2000, SIAM Rev..

[5]  Daniel P. Miranker,et al.  On directly mapping relational databases to RDF and OWL , 2012, WWW.

[6]  Lynn M. Schriml,et al.  GeMInA, Genomic Metadata for Infectious Agents, a geospatial surveillance pathogen database , 2009, Nucleic Acids Res..

[7]  Michael Small,et al.  Dynamical Modeling of Collective Behavior from Pigeon Flight Data: Flock Cohesion and Dispersion , 2011, PLoS Comput. Biol..

[8]  Samson W. Tu,et al.  DataMaster – a Plug-in for Importing Schemas and Data from Relational Databases into Protégé , 2007 .

[9]  J. Michael Pratt,et al.  Data modeling of scientific experimentation , 1995, SAC '95.

[10]  Simon Cauchemez,et al.  Model-Based Comprehensive Analysis of School Closure Policies for Mitigating Influenza Epidemics and Pandemics , 2016, PLoS Comput. Biol..

[11]  Yavor Nenov,et al.  Semantic Technologies for Data Analysis in Health Care , 2016, SEMWEB.

[12]  JoAnne Yates,et al.  Genre taxonomy: A knowledge repository of communicative actions , 2001, TOIS.

[13]  Kate Byrne Having Triplets – Holding Cultural Data as RDF , 2008 .

[14]  N. Ling The Mathematical Theory of Infectious Diseases and its applications , 1978 .

[15]  Evgeny Kharlamov,et al.  Faceted search over RDF-based knowledge graphs , 2016, J. Web Semant..

[16]  Aravind Srinivasan,et al.  Modelling disease outbreaks in realistic urban social networks , 2004, Nature.

[17]  M. Keeling,et al.  Networks and epidemic models , 2005, Journal of The Royal Society Interface.

[18]  José Viterbo Filho,et al.  RDB2RDF plugin: relational databases to RDF plugin for eclipse , 2011, TOPI '11.

[19]  Yannis Stavrakas,et al.  Publishing life science data as linked open data: the case study of miRBase , 2012, WOD.

[20]  Catherine Linard,et al.  Exploring nationally and regionally defined models for large area population mapping , 2015, Int. J. Digit. Earth.

[21]  Andrew M. Jenkinson,et al.  The EBI RDF platform: linked open data for the life sciences , 2014, Bioinform..

[22]  Madhav V. Marathe,et al.  ISIS: a networked-epidemiology based pervasive web app for infectious disease pandemic planning and response , 2014, KDD.

[23]  Edward A. Fox,et al.  Data mapping framework in a digital library with computational epidemiology datasets , 2014, IEEE/ACM Joint Conference on Digital Libraries.

[24]  Madhav V. Marathe,et al.  Forecasting a Moving Target: Ensemble Models for ILI Case Count Predictions , 2014, SDM.

[25]  Alexander Grey,et al.  The Mathematical Theory of Infectious Diseases and Its Applications , 1977 .

[26]  Julian Dolby,et al.  Building an efficient RDF store over a relational database , 2013, SIGMOD '13.

[27]  Stefan Conrad,et al.  Relational.OWL - A Data and Schema Representation Format Based on OWL , 2005, APCCM.

[28]  Erik Aurell,et al.  The Maximum Entropy Fallacy Redux? , 2016, PLoS Comput. Biol..

[29]  Georg Lausen,et al.  Relational Databases in RDF: Keys and Foreign Keys , 2008, SWDB-ODBIS.

[30]  Gerald Reif,et al.  A comparison of RDB-to-RDF mapping languages , 2011, I-Semantics '11.

[31]  K. Glass,et al.  How Much Would Closing Schools Reduce Transmission During an Influenza Pandemic? , 2007, Epidemiology.

[32]  Edward A. Fox,et al.  A Scalable Data Management Tool to Support Epidemiological Modeling of Large Urban Regions , 2007, ECDL.

[33]  Asbjørn Følstad,et al.  Political Social Media Sites as Public Sphere: A Case Study of the Norwegian Labour Party , 2014, Commun. Assoc. Inf. Syst..

[34]  Mike Dean,et al.  Use of OWL and SWRL for Semantic Relational Database Translation , 2008, OWLED.

[35]  Takahiro Ikeda,et al.  Information Classification and Navigation Based on 5W1H of the Target Information , 1998, COLING-ACL.

[36]  Maxine S. Cohen,et al.  A process-oriented scientific database model , 1992, SGMD.

[37]  Edward A. Fox,et al.  Epidemiology experimentation and simulation management through scientific digital libraries , 2012 .

[38]  Huajun Chen,et al.  RDF-based schema mediation for database grid , 2004, Fifth IEEE/ACM International Workshop on Grid Computing.

[39]  Ibrahim Elmadfa,et al.  Behaviour change for better health: nutrition, hygiene and sustainability , 2013, BMC Public Health.

[40]  A. J. Hall Infectious diseases of humans: R. M. Anderson & R. M. May. Oxford etc.: Oxford University Press, 1991. viii + 757 pp. Price £50. ISBN 0-19-854599-1 , 1992 .

[41]  Madhav V. Marathe,et al.  EpiFast: a fast algorithm for large scale realistic epidemic simulations on distributed memory systems , 2009, ICS.

[42]  Natalya F. Noy,et al.  BioPortal: Ontologies and Integrated Data Resources at the Click of a Mouse , 2009 .

[43]  Daniela Perrotta,et al.  Forecasting Seasonal Influenza Fusing Digital Indicators and a Mechanistic Disease Model , 2017, WWW.

[44]  Hugo Alexandre Ferreira,et al.  Epidemic Marketplace: An Information Management System for Epidemiological Data , 2010, ITBAM.

[45]  Caroline O. Buckee,et al.  Digital Epidemiology , 2012, PLoS Comput. Biol..

[46]  Il Hong Suh,et al.  Semantic Robot Memory Store using 5W1H for Service Tasks , 2010 .

[47]  Joshua M. Epstein,et al.  Modelling to contain pandemics , 2009, Nature.

[48]  D. Allon Data Modelling for an Epidemiological Database , 1997 .

[49]  Yanchun Zhang,et al.  Development of Web-Based Epidemiological Reporting System for Tasmania Utilizing a Google Maps Add-On , 2007, 9th Biennial Conference of the Australian Pattern Recognition Society on Digital Image Computing Techniques and Applications (DICTA 2007).

[50]  N. Ferguson,et al.  Epidemic and intervention modelling--a scientific rationale for policy decisions? Lessons from the 2009 influenza pandemic. , 2012, Bulletin of the World Health Organization.

[51]  Matthew S. Mayernik,et al.  Drowning in data: digital library architecture to support scientific use of embedded sensor networks , 2007, JCDL '07.

[52]  Michel Dumontier,et al.  An Ebola virus-centered knowledge base , 2015, Database J. Biol. Databases Curation.

[53]  Shawn T. Brown,et al.  FRED (A Framework for Reconstructing Epidemic Dynamics): an open-source software system for modeling infectious diseases and control strategies using census-based populations , 2013, BMC Public Health.

[54]  Carlo Curino,et al.  Accessing and Documenting Relational Databases through OWL Ontologies , 2009, FQAS.

[55]  David L. Smith,et al.  A World Malaria Map: Plasmodium falciparum Endemicity in 2007 , 2009, PLoS medicine.

[56]  Nikolas Mitrou,et al.  Bringing relational databases into the Semantic Web: A survey , 2012, Semantic Web.

[57]  Jan L. Top,et al.  From Relational Data to RDFS Models , 2004, ICWE.

[58]  L. Meyers Contact network epidemiology: Bond percolation applied to infectious disease prediction and control , 2006 .

[59]  E. Nsoesie,et al.  A systematic review of studies on forecasting the dynamics of influenza outbreaks , 2013, Influenza and other respiratory viruses.

[60]  Madhav V. Marathe,et al.  Computational epidemiology , 2013, CACM.

[61]  Eric Prud'hommeaux,et al.  Interpreting relational databases in the RDF domain , 2011, K-CAP '11.

[62]  Pierre-Antoine Champin,et al.  Cross: An OWL Wrapper for Reasoning on Relational Databases , 2007, ER.

[63]  Frederico Araújo Durão,et al.  Recommending Open Linked Data in Creativity Sessions using Web Portals with Collaborative Real Time Environment , 2011, J. Univers. Comput. Sci..

[64]  David L Smith,et al.  Progress and Challenges in Infectious Disease Cartography. , 2016, Trends in parasitology.

[65]  Ian Horrocks,et al.  Using Semantic Technology to Tame the Data Variety Challenge , 2016, IEEE Internet Computing.

[66]  Madhav V. Marathe,et al.  Indemics: an interactive data intensive framework for high performance epidemic simulation , 2010, ICS '10.

[67]  R. May,et al.  Dimensions of superspreading , 2005, Nature.

[68]  Golan Yona,et al.  BIOZON: a system for unification, management and analysis of heterogeneous biological data , 2006, BMC Bioinformatics.

[69]  Sebastian Ebers,et al.  Efficient processing of SPARQL joins in memory by dynamically restricting triple patterns , 2009, SAC '09.

[70]  Gerald Reif,et al.  Updating relational data via SPARQL/update , 2010, EDBT '10.

[71]  Alessandro Vespignani,et al.  Opinion: Mathematical models: A key tool for outbreak response , 2014, Proceedings of the National Academy of Sciences.

[72]  Madhav V. Marathe,et al.  EpiSimdemics: An efficient algorithm for simulating the spread of infectious disease over large realistic social networks , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[73]  Andrea Splendiani,et al.  Towards linked open gene mutations data , 2011, BMC Bioinformatics.