Generalized k-means in GLMs with applications to the outbreak of COVID-19 in the United States

Generalized k -means can be combined with any similarity or dissimilarity measure for clustering. Using the well known likelihood ratio or F -statistic as the dissimilarity measure, a generalized k -means method is proposed to group generalized linear models (GLMs) for exponential family distributions. Given the number of clusters k , the proposed method is established by the uniform most powerful unbiased (UMPU) test statistic for the comparison between GLMs. If k is unknown, then the proposed method can be combined with generalized liformation criterion (GIC) to automatically select the best k for clustering. Both AIC and BIC are investigated as special cases of GIC. Theoretical and simulation results show that the number of clusters can be correctly identified by BIC but not AIC. The proposed method is applied to the state-level daily COVID-19 data in the United States, and it identifies 6 clusters. A further study shows that the models between clusters are significantly different from each other, which confirms the result with 6 clusters.

[1]  Jianqing Fan,et al.  Variance estimation using refitted cross‐validation in ultrahigh dimensional regression , 2010, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[2]  Miguel Rios,et al.  First and second waves of coronavirus disease-19: A comparative study in hospitalized patients in Reus, Spain , 2020, medRxiv.

[3]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[4]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[5]  T. Ferguson A Course in Large Sample Theory , 1996 .

[6]  Richard A. Johnson,et al.  Applied Multivariate Statistical Analysis , 1983 .

[7]  Antara Prakash,et al.  Review on K-Mode Clustering , 2016 .

[8]  Junhui Wang Consistent selection of the number of clusters via crossvalidation , 2010 .

[9]  Wojciech Kwedlo,et al.  A clustering method combining differential evolution with the K-means algorithm , 2011, Pattern Recognit. Lett..

[10]  L. Qin,et al.  The Clustering of Regression Models Method with Applications in Gene Expression Data , 2006, Biometrics.

[11]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[12]  Teofilo F. GONZALEZ,et al.  Clustering to Minimize the Maximum Intercluster Distance , 1985, Theor. Comput. Sci..

[13]  P. McCullagh Quasi-Likelihood Functions , 1983 .

[14]  E. Forgy,et al.  Cluster analysis of multivariate data : efficiency versus interpretability of classifications , 1965 .

[15]  Martin Ester,et al.  Density‐based clustering , 2019, WIREs Data Mining Knowl. Discov..

[16]  D. Donoho,et al.  Higher criticism for detecting sparse heterogeneous mixtures , 2004, math/0410072.

[17]  Ting Yu,et al.  Epidemiological and clinical characteristics of 99 cases of 2019 novel coronavirus pneumonia in Wuhan, China: a descriptive study , 2020, The Lancet.

[18]  C. Viboud,et al.  Early epidemiological analysis of the coronavirus disease 2019 outbreak based on crowdsourced data: a population-level observational study , 2020, The Lancet Digital Health.

[19]  Éric Gaussier,et al.  Generalized k-means-based clustering for temporal data under weighted and kernel time warp , 2016, Pattern Recognit. Lett..

[20]  Runze Li,et al.  Regularization Parameter Selections via Generalized Information Criterion , 2010, Journal of the American Statistical Association.

[21]  H. V. Ribeiro,et al.  Spreading Patterns of the Influenza A (H1N1) Pandemic , 2011, PloS one.

[22]  D. Brockmann,et al.  Effective containment explains subexponential growth in recent confirmed COVID-19 cases in China , 2020, Science.

[23]  Allen G. Hunt,et al.  Exponential growth in Ebola outbreak since May 14, 2014 , 2014, Complex..

[24]  Peter J. Rousseeuw,et al.  Fuzzy clustering algorithms based on the maximum likelihood principle , 1991 .

[25]  Sudipto Guha,et al.  A constant-factor approximation algorithm for the k-median problem (extended abstract) , 1999, STOC '99.

[26]  P. Green Iteratively reweighted least squares for maximum likelihood estimation , 1984 .

[27]  Kwok Pui Choi,et al.  Consistency of AIC and BIC in estimating the number of significant components in high-dimensional principal component analysis , 2018, The Annals of Statistics.

[28]  Eric C. Chi,et al.  Splitting Methods for Convex Clustering , 2013, Journal of computational and graphical statistics : a joint publication of American Statistical Association, Institute of Mathematical Statistics, Interface Foundation of North America.

[29]  Y. Hu,et al.  Clinical features of patients infected with 2019 novel coronavirus in Wuhan, China , 2020, The Lancet.

[30]  Shruti Aggarwal,et al.  A REVIEW ON K-MODE CLUSTERING ALGORITHM , 2017 .

[31]  L. Ljung,et al.  Just Relax and Come Clustering! : A Convexification of k-Means Clustering , 2011 .

[32]  Bertrand S. Clarke,et al.  A Bayesian criterion for cluster stability , 2013, Stat. Anal. Data Min..

[33]  George Karypis,et al.  Hierarchical Clustering Algorithms for Document Datasets , 2005, Data Mining and Knowledge Discovery.

[34]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[35]  Francis R. Bach,et al.  Clusterpath: an Algorithm for Clustering using Convex Fusion Penalties , 2011, ICML.

[36]  Garud Iyengar,et al.  Modeling Multimodal Continuous Heterogeneity in Conjoint Analysis - A Sparse Learning Approach , 2017, Mark. Sci..

[37]  P. Green,et al.  Bayesian Model-Based Clustering Procedures , 2007 .

[38]  Hans-Peter Kriegel,et al.  Density‐based clustering , 2011, WIREs Data Mining Knowl. Discov..

[39]  Qiang Du,et al.  Numerical studies of MacQueen's k-means algorithm for computing the centroidal Voronoi tessellations , 2002 .

[40]  Adrian E. Raftery,et al.  Model-Based Clustering, Discriminant Analysis, and Density Estimation , 2002 .