Consistencies and inconsistencies between model selection and link prediction in networks.

A principled approach to understand network structures is to formulate generative models. Given a collection of models, however, an outstanding key task is to determine which one provides a more accurate description of the network at hand, discounting statistical fluctuations. This problem can be approached using two principled criteria that at first may seem equivalent: selecting the most plausible model in terms of its posterior probability; or selecting the model with the highest predictive performance in terms of identifying missing links. Here we show that while these two approaches yield consistent results in most cases, there are also notable instances where they do not, that is, where the most plausible model is not the most predictive. We show that in the latter case the improvement of predictive performance can in fact lead to overfitting both in artificial and empirical settings. Furthermore, we show that, in general, the predictive performance is higher when we average over collections of models that are individually less plausible than when we consider only the single most plausible model.

[1]  Stanford,et al.  Learning to Discover Social Circles in Ego Networks , 2012 .

[2]  Tatsuro Kawamoto,et al.  Algorithmic detectability threshold of the stochastic blockmodel , 2017, Physical review. E.

[3]  P. Latouche,et al.  Model selection and clustering in stochastic block models based on the exact integrated complete data likelihood , 2015 .

[4]  Kathryn B. Laskey,et al.  Stochastic blockmodels: First steps , 1983 .

[5]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[6]  Munmun De Choudhury,et al.  Social Synchrony: Predicting Mimicry of User Actions in Online Social Media , 2009, 2009 International Conference on Computational Science and Engineering.

[7]  Roger Guimerà,et al.  Multilayer stochastic block models reveal the multilayer structure of complex networks , 2014, ArXiv.

[8]  M. Newman,et al.  Hierarchical structure and the prediction of missing links in networks , 2008, Nature.

[9]  J. Herskowitz,et al.  Proceedings of the National Academy of Sciences, USA , 1996, Current Biology.

[10]  Dunja Mladenic,et al.  Proceedings of the 3rd international workshop on Link discovery , 2005, KDD 2005.

[11]  Cristopher Moore,et al.  Community detection, link prediction, and layer interdependence in multilayer networks , 2017, Physical review. E.

[12]  A Díaz-Guilera,et al.  Self-similar community structure in a network of human interactions. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[13]  Lada A. Adamic,et al.  The political blogosphere and the 2004 U.S. election: divided they blog , 2005, LinkKDD '05.

[14]  M. Mézard,et al.  Journal of Statistical Mechanics: Theory and Experiment , 2011 .

[15]  Tiago P. Peixoto Nonparametric Bayesian inference of the microcanonical stochastic block model. , 2016, Physical review. E.

[16]  Krishna P. Gummadi,et al.  On the evolution of user interaction in Facebook , 2009, WOSN '09.

[17]  Cynthia M. Webster,et al.  Exploring social structure using dynamic three-dimensional color images , 1998 .

[18]  Michael Ley,et al.  The DBLP Computer Science Bibliography: Evolution, Research Issues, Perspectives , 2002, SPIRE.

[19]  Roger Guimerà,et al.  Accurate and scalable social recommendation using mixed-membership stochastic block models , 2016, Proceedings of the National Academy of Sciences.

[20]  Jérôme Kunegis,et al.  KONECT: the Koblenz network collection , 2013, WWW.

[21]  S. Brenner,et al.  The structure of the nervous system of the nematode Caenorhabditis elegans. , 1986, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[22]  Jorma Rissanen,et al.  Minimum Description Length Principle , 2010, Encyclopedia of Machine Learning.

[23]  Shay B. Cohen,et al.  Advances in Neural Information Processing Systems 25 , 2012, NIPS 2012.

[24]  Roger Guimerà,et al.  Missing and spurious interactions and the reconstruction of complex networks , 2009, Proceedings of the National Academy of Sciences.

[25]  Mark E. J. Newman,et al.  Stochastic blockmodels and community structure in networks , 2010, Physical review. E, Statistical, nonlinear, and soft matter physics.

[26]  Sylvain Arlot,et al.  A survey of cross-validation procedures for model selection , 2009, 0907.4728.

[27]  Martin Rosvall,et al.  An information-theoretic framework for resolving community structure in complex networks , 2007, Proceedings of the National Academy of Sciences.

[28]  James Moody,et al.  Peer influence groups: identifying dense clusters in large networks , 2001, Soc. Networks.

[29]  Jure Leskovec,et al.  Learning to Discover Social Circles in Ego Networks , 2012, NIPS.

[30]  Christos Faloutsos,et al.  Graph evolution: Densification and shrinking diameters , 2006, TKDD.

[31]  F. S. Prout Philosophical Transactions of the Royal Society of London , 2009, The London Medical Journal.

[32]  Adrian E. Raftery,et al.  Bayesian model averaging: a tutorial (with comments by M. Clyde, David Draper and E. I. George, and a rejoinder by the authors , 1999 .

[33]  Yoshiyuki Kabashima,et al.  Cross-validation estimate of the number of clusters in a network , 2016, Scientific Reports.

[34]  Tiago P Peixoto,et al.  Parsimonious module inference in large networks. , 2012, Physical review letters.

[35]  H. Lehrach,et al.  A Human Protein-Protein Interaction Network: A Resource for Annotating the Proteome , 2005, Cell.

[36]  T. Snijders,et al.  Estimation and Prediction for Stochastic Blockstructures , 2001 .

[37]  Virgílio A. F. Almeida,et al.  Proceedings of the 22nd international conference on World Wide Web , 2013, WWW 2013.

[38]  Roger Guimerà,et al.  Predicting Human Preferences Using the Block Structure of Complex Social Networks , 2012, PloS one.

[39]  J. Shao Linear Model Selection by Cross-validation , 1993 .

[40]  Tiago P. Peixoto,et al.  Trust Transitivity in Social Networks , 2010, PloS one.

[41]  Alberto H. F. Laender,et al.  Proceedings of the 9th International Symposium on String Processing and Information Retrieval , 2002 .

[42]  Marko Bajec,et al.  Model of complex networks based on citation dynamics , 2013, WWW.

[43]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[44]  M E J Newman,et al.  Modularity and community structure in networks. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[45]  Tiago P. Peixoto Hierarchical block structures and high-resolution model selection in large networks , 2013, ArXiv.

[46]  M E J Newman,et al.  Community structure in social and biological networks , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[47]  E. Birney,et al.  Reactome: a knowledgebase of biological pathways , 2004, Nucleic Acids Research.

[48]  Cristopher Moore,et al.  Asymptotic analysis of the stochastic block model for modular networks and its algorithmic applications , 2011, Physical review. E, Statistical, nonlinear, and soft matter physics.

[49]  Cristopher Moore,et al.  Model selection for degree-corrected block models , 2012, Journal of statistical mechanics.

[50]  David Saad,et al.  The Interplay between Microscopic and Mesoscopic Structures in Complex Networks , 2010, PloS one.

[51]  Paolo Massa,et al.  Bowling Alone and Trust Decline in Social Network Sites , 2009, 2009 Eighth IEEE International Conference on Dependable, Autonomic and Secure Computing.

[52]  Balachander Krishnamurthy,et al.  Proceedings of the 2nd ACM workshop on Online social networks , 2009, SIGCOMM 2009.

[53]  O. William Journal Of The American Statistical Association V-28 , 1932 .

[54]  Yoshiyuki Kabashima,et al.  Cross-validation estimate of the number of clusters in a network , 2017, Scientific Reports.