Model-Based Clustering of Large Networks

We describe a network clustering framework, based on finite mixture models, that can be applied to discrete-valued networks with hundreds of thousands of nodes and billions of edge variables. Relative to other recent model-based clustering work for networks, we introduce a more flexible modeling framework, improve the variational-approximation estimation algorithm, discuss and implement standard error estimation via a parametric bootstrap approach, and apply these methods to much larger data sets than those seen elsewhere in the literature. The more flexible framework is achieved through introducing novel parameterizations of the model, giving varying degrees of parsimony, using exponential family models whose structure may be exploited in various theoretical and algorithmic ways. The algorithms are based on variational generalized EM algorithms, where the E-steps are augmented by a minorization-maximization (MM) idea. The bootstrapped standard error estimates are based on an efficient Monte Carlo network simulation idea. Last, we demonstrate the usefulness of the model-based clustering framework by applying it to a discrete-valued network with more than 131,000 nodes and 17 billion edge variables.

[1]  P. Pattison,et al.  New Specifications for Exponential Random Graph Models , 2006 .

[2]  Edoardo M. Airoldi,et al.  Mixed Membership Stochastic Blockmodels , 2007, NIPS.

[3]  David Hinkley,et al.  Bootstrap Methods: Another Look at the Jackknife , 2008 .

[4]  Paolo Avesani,et al.  Trust Metrics on Controversial Users: Balancing Between Tyranny of the Majority , 2007, Int. J. Semantic Web Inf. Syst..

[5]  Béla Bollobás,et al.  Random Graphs , 1985 .

[6]  Ove Frank,et al.  http://www.jstor.org/about/terms.html. JSTOR's Terms and Conditions of Use provides, in part, that unless you have obtained , 2007 .

[7]  Peter D. Hoff,et al.  Fast Inference for the Latent Space Network Model Using a Case-Control Approximate Likelihood , 2012, Journal of computational and graphical statistics : a joint publication of American Statistical Association, Institute of Mathematical Statistics, Interface Foundation of North America.

[8]  P. O’Neill,et al.  Bayesian inference for stochastic epidemics in populations with random social structure , 2002 .

[9]  Lada A. Adamic,et al.  The political blogosphere and the 2004 U.S. election: divided they blog , 2005, LinkKDD '05.

[10]  Radford M. Neal A new view of the EM algorithm that justifies incremental and other variants , 1993 .

[11]  St'ephane Robin,et al.  Uncovering latent structure in valued graphs: A variational approach , 2010, 1011.1813.

[12]  Alan M. Frieze,et al.  Random graphs , 2006, SODA '06.

[13]  J. Besag Spatial Interaction and the Statistical Analysis of Lattice Systems , 1974 .

[14]  Jean-Jacques Daudin,et al.  Model for Heterogeneous Random Networks Using Continuous Latent Variables and an Application to a Tree–Fungus Network , 2010, Biometrics.

[15]  Stefan M. Stefanov Convex Quadratic Minimization Subject to a Linear Constraint and Box Constraints , 2004 .

[16]  William H. Press,et al.  Numerical recipes in C , 2002 .

[17]  Michael Schweinberger,et al.  Disaster response on September 11, 2001 through the lens of statistical network analysis , 2014, Soc. Networks.

[18]  M. Handcock Center for Studies in Demography and Ecology Assessing Degeneracy in Statistical Models of Social Networks , 2005 .

[19]  Purnamrita Sarkar,et al.  A scalable bootstrap for massive data , 2011, 1112.5016.

[20]  T. Snijders,et al.  Estimation and Prediction for Stochastic Blockmodels for Graphs with Latent Block Structure , 1997 .

[21]  S. Wasserman,et al.  Logit models and logistic regressions for social networks: I. An introduction to Markov graphs andp , 1996 .

[22]  Michael I. Jordan,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[23]  Christian Tallberg A BAYESIAN APPROACH TO MODELING STOCHASTIC BLOCKSTRUCTURES WITH COVARIATES , 2004 .

[24]  A. Raftery,et al.  Model‐based clustering for social networks , 2007 .

[25]  D. J. Strauss,et al.  Pseudolikelihood Estimation for Social Networks , 1990 .

[26]  M. Stephens Dealing with label switching in mixture models , 2000 .

[27]  Alberto Caimo,et al.  Bayesian inference for exponential random graph models , 2010, Soc. Networks.

[28]  Thomas Brendan Murphy,et al.  Variational Bayesian inference for the Latent Position Cluster Model for network data , 2009, Comput. Stat. Data Anal..

[29]  J. Møller,et al.  An efficient Markov chain Monte Carlo method for distributions with intractable normalising constants , 2006 .

[30]  O. Barndorff-Nielsen Information and Exponential Families in Statistical Theory , 1980 .

[31]  Christian Bauckhage,et al.  The slashdot zoo: mining a social network with negative edges , 2009, WWW.

[32]  D. Hunter,et al.  mixtools: An R Package for Analyzing Mixture Models , 2009 .

[33]  David Strauss On a general class of models for interaction , 1986 .

[34]  T. Snijders,et al.  10. Settings in Social Networks: A Measurement Model , 2003 .

[35]  D. Hunter,et al.  Bayesian Inference for Contact Networks Given Epidemic Data , 2010 .

[36]  S. Wasserman,et al.  Logit models and logistic regressions for social networks: II. Multivariate relations. , 1999, The British journal of mathematical and statistical psychology.

[37]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[38]  Tom A. B. Snijders,et al.  Markov Chain Monte Carlo Estimation of Exponential Random Graph Models , 2002, J. Soc. Struct..

[39]  Garry Robins,et al.  Analysing exponential random graph (p-star) models with missing data using Bayesian data augmentation , 2010 .

[40]  F. A. Seiler,et al.  Numerical Recipes in C: The Art of Scientific Computing , 1989 .

[41]  Franck Picard,et al.  A mixture model for random graphs , 2008, Stat. Comput..

[42]  D. Hunter,et al.  Inference in Curved Exponential Family Models for Networks , 2006 .

[43]  P. Holland,et al.  An Exponential Family of Probability Distributions for Directed Graphs , 1981 .

[44]  J. F. C. Kingman,et al.  Information and Exponential Families in Statistical Theory , 1980 .

[45]  Alain Celisse,et al.  Consistency of maximum-likelihood and variational estimators in the Stochastic Block Model , 2011, 1105.3288.

[46]  M. Schweinberger Instability, Sensitivity, and Degeneracy of Discrete Exponential Families , 2011, Journal of the American Statistical Association.

[47]  P. Deb Finite Mixture Models , 2008 .

[48]  R. Fisher,et al.  On the Mathematical Foundations of Theoretical Statistics , 1922 .

[49]  D. Hunter,et al.  A Tutorial on MM Algorithms , 2004 .

[50]  Peter D. Hoff,et al.  Latent Space Approaches to Social Network Analysis , 2002 .

[51]  Christophe Ambroise,et al.  Strategies for online inference of model-based clustering in large and growing networks , 2009, 0910.2034.

[52]  T. Snijders,et al.  Estimation and Prediction for Stochastic Blockstructures , 2001 .

[53]  Bo Wang,et al.  Inadequacy of interval estimates corresponding to variational Bayesian approximations , 2005, AISTATS.

[54]  J. Davis Statistical analysis of pair relationships: symmetry, subjective consistency and reciprocity. , 1968, Sociometry.