An Automated Approach for Finding Spatio-Temporal Patterns of Seasonal Influenza in the United States: Algorithm Validation Study

Background Agencies such as the Centers for Disease Control and Prevention (CDC) currently release influenza-like illness incidence data, along with descriptive summaries of simple spatio-temporal patterns and trends. However, public health researchers, government agencies, as well as the general public, are often interested in deeper patterns and insights into how the disease is spreading, with additional context. Analysis by domain experts is needed for deriving such insights from incidence data. Objective Our goal was to develop an automated approach for finding interesting spatio-temporal patterns in the spread of a disease over a large region, such as regions which have specific characteristics (eg, high incidence in a particular week, those which showed a sudden change in incidence) or regions which have significantly different incidence compared to earlier seasons. Methods We developed techniques from the area of transactional data mining for characterizing and finding interesting spatio-temporal patterns in disease spread in an automated manner. A key part of our approach involved using the principle of minimum description length for representing a given target set in terms of combinations of attributes (referred to as clauses); we considered both positive and negative clauses, relaxed descriptions which approximately represent the set, and used integer programming to find such descriptions. Finally, we designed an automated approach, which examines a large space of sets corresponding to different spatio-temporal patterns, and ranks them based on the ratio of their size to their description length (referred to as the compression ratio). Results We applied our methods using minimum description length to find spatio-temporal patterns in the spread of seasonal influenza in the United States using state level influenza-like illness activity indicator data from the CDC. We observed that the compression ratios were over 2.5 for 50% of the chosen sets, when approximate descriptions and negative clauses were allowed. Sets with high compression ratios (eg, over 2.5) corresponded to interesting patterns in the spatio-temporal dynamics of influenza-like illness. Our approach also outperformed description by solution in terms of the compression ratio. Conclusions Our approach, which is an unsupervised machine learning method, can provide new insights into patterns and trends in the disease spread in an automated manner. Our results show that the description complexity is an effective approach for characterizing sets of interest, which can be easily extended to other diseases and regions beyond influenza in the US. Our approach can also be easily adapted for automated generation of narratives.

[1]  Jieping Ye,et al.  Dynamic Poisson Autoregression for Influenza-Like-Illness Case Count Prediction , 2015, KDD.

[2]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.

[3]  Pauli Miettinen,et al.  Model order selection for boolean matrix factorization , 2011, KDD.

[4]  Ian H. Witten,et al.  Data Mining, Fourth Edition: Practical Machine Learning Tools and Techniques , 2016 .

[5]  Mark A. Pitt,et al.  Advances in Minimum Description Length: Theory and Applications , 2005 .

[6]  Alessandro Vespignani,et al.  Real-time numerical forecast of global epidemic spreading: case study of 2009 A/H1N1pdm , 2012, BMC Medicine.

[7]  Vipin Kumar,et al.  Summarization - compressing data into an informative representation , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[8]  Jorma Rissanen,et al.  Minimum Description Length Principle , 2010, Encyclopedia of Machine Learning.

[9]  Takeaki Uno,et al.  Frequent Pattern Mining , 2016, Encyclopedia of Algorithms.

[10]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[11]  Madhav V. Marathe,et al.  Forecasting a Moving Target: Ensemble Models for ILI Case Count Predictions , 2014, SDM.

[12]  Yang Xiang,et al.  Summarizing transactional databases with overlapped hyperrectangles , 2011, Data Mining and Knowledge Discovery.

[13]  Arlindo L. Oliveira,et al.  Biclustering algorithms for biological data analysis: a survey , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[14]  Jiawei Han,et al.  Frequent pattern mining: current status and future directions , 2007, Data Mining and Knowledge Discovery.

[15]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[16]  Umeshwar Dayal,et al.  FreeSpan: frequent pattern-projected sequential pattern mining , 2000, KDD '00.

[17]  Jilles Vreeken,et al.  Krimp: mining itemsets that compress , 2011, Data Mining and Knowledge Discovery.

[18]  Jae-Gil Lee,et al.  Trajectory clustering: a partition-and-group framework , 2007, SIGMOD '07.

[19]  Yue Xu,et al.  Automatic Pattern-Taxonomy Extraction for Web Mining , 2004, IEEE/WIC/ACM International Conference on Web Intelligence (WI'04).

[20]  Ronald Rosenfeld,et al.  Flexible Modeling of Epidemics with an Empirical Bayes Framework , 2014, PLoS Comput. Biol..