Community Structure in Large Networks: Natural Cluster Sizes and the Absence of Large Well-Defined Clusters

A large body of work has been devoted to defining and identifying clusters or communities in social and information networks, i.e., in graphs in which the nodes represent underlying social entities and the edges represent some sort of interaction between pairs of nodes. Most such research begins with the premise that a community or a cluster should be thought of as a set of nodes that has more and/or better connections between its members than to the remainder of the network. In this paper, we explore from a novel perspective several questions related to identifying meaningful communities in large social and information networks, and we come to several striking conclusions. Rather than defining a procedure to extract sets of nodes from a graph and then attempting to interpret these sets as "real" communities, we employ approximation algorithms for the graph-partitioning problem to characterize as a function of size the statistical and structural properties of partitions of graphs that could plausibly be interpreted as communities. In particular, we define the network community profile plot, which characterizes the "best" possible community—according to the conductance measure—over a wide range of size scales. We study over one hundred large real-world networks, ranging from traditional and online social networks, to technological and information networks and web graphs, and ranging in size from thousands up to tens of millions of nodes. Our results suggest a significantly more refined picture of community structure in large networks than has been appreciated previously. Our observations agree with previous work on small networks, but we show that large networks have a very different structure. In particular, we observe tight communities that are barely connected to the rest of the network at very small size scales (up to ≈ 100 nodes); and communities of size scale beyond ≈ 100 nodes gradually "blend into" the expander-like core of the network and thus become less "community-like," with a roughly inverse relationship between community size and optimal community quality. This observation agrees well with the so-called Dunbar number, which gives a limit to the size of a well-functioning community. However, this behavior is not explained, even at a qualitative level, by any of the commonly used network-generation models. Moreover, it is exactly the opposite of what one would expect based on intuition from expander graphs, low-dimensional or manifold-like graphs, and from small social networks that have served as test beds of community-detection algorithms. The relatively gradual increase of the network community profile plot as a function of increasing community size depends in a subtle manner on the way in which local clustering information is propagated from smaller to larger size scales in the network. We have found that a generative graph model, in which new edges are added via an iterative "forest fire" burning process, is able to produce graphs exhibiting a network community profile plot similar to what we observe in our network data sets.

[1]  K. Back Influence through social communication. , 1951, Journal of abnormal psychology.

[2]  J. Cheeger A lower bound for the smallest eigenvalue of the Laplacian , 1969 .

[3]  Brian W. Kernighan,et al.  An efficient heuristic procedure for partitioning graphs , 1970, Bell Syst. Tech. J..

[4]  Charles T. Zahn,et al.  Graph-Theoretical Methods for Detecting and Describing Gestalt Clusters , 1971, IEEE Transactions on Computers.

[5]  M. Fiedler Algebraic connectivity of graphs , 1973 .

[6]  W. Zachary,et al.  An Information Flow Model for Conflict and Fission in Small Groups , 1977, Journal of Anthropological Research.

[7]  R. M. Mattheyses,et al.  A Linear-Time Heuristic for Improving Network Partitions , 1982, 19th Design Automation Conference.

[8]  Andrew V. Goldberg,et al.  A new approach to the maximum flow problem , 1986, STOC '86.

[9]  Frank Thomson Leighton,et al.  An approximate max-flow min-cut theorem for uniform multicommodity flow problems with applications to approximation algorithms , 1988, [Proceedings 1988] 29th Annual Symposium on Foundations of Computer Science.

[10]  John Scott What is social network analysis , 2010 .

[11]  Robert E. Tarjan,et al.  A Fast Parametric Maximum Flow Algorithm and Applications , 1989, SIAM J. Comput..

[12]  B. Mohar THE LAPLACIAN SPECTRUM OF GRAPHS y , 1991 .

[13]  Satish Rao,et al.  Finding near-optimal cuts: an empirical evaluation , 1993, SODA '93.

[14]  Bruce A. Reed,et al.  A Critical Point for Random Graphs with a Given Degree Sequence , 1995, Random Struct. Algorithms.

[15]  S. McDonough Grooming. , 1995, The Veterinary clinics of North America. Small animal practice.

[16]  Bruce Hendrickson,et al.  A Multi-Level Algorithm For Partitioning Graphs , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[17]  Andrew V. Goldberg,et al.  On Implementing Push-Relabel Method for the Maximum Flow Problem , 1995, IPCO.

[18]  Shang-Hua Teng,et al.  Spectral partitioning works: planar graphs and finite element meshes , 1996, Proceedings of 37th Conference on Foundations of Computer Science.

[19]  Fan Chung,et al.  Spectral Graph Theory , 1996 .

[20]  Andrew V. Goldberg,et al.  On Implementing the Push—Relabel Method for the Maximum Flow Problem , 1997, Algorithmica.

[21]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[22]  George Karypis,et al.  Multilevel k-way Partitioning Scheme for Irregular Graphs , 1998, J. Parallel Distributed Comput..

[23]  Stephen Guattery,et al.  On the Quality of Spectral Separators , 1998, SIAM J. Matrix Anal. Appl..

[24]  Bruce A. Reed,et al.  The Size of the Giant Component of a Random Graph with a Given Degree Sequence , 1998, Combinatorics, Probability and Computing.

[25]  Jon M. Kleinberg,et al.  Inferring Web communities from link topology , 1998, HYPERTEXT '98.

[26]  Andrew V. Goldberg,et al.  Beyond the flow decomposition barrier , 1998, JACM.

[27]  Duncan J. Watts,et al.  Collective dynamics of ‘small-world’ networks , 1998, Nature.

[28]  Vipin Kumar,et al.  A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..

[29]  Frank Thomson Leighton,et al.  Multicommodity max-flow min-cut theorems and their use in designing approximation algorithms , 1999, JACM.

[30]  Albert-László Barabási,et al.  Internet: Diameter of the World-Wide Web , 1999, Nature.

[31]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[32]  Albert,et al.  Emergence of scaling in random networks , 1999, Science.

[33]  Ravi Kumar,et al.  Trawling the Web for Emerging Cyber-Communities , 1999, Comput. Networks.

[34]  Michalis Faloutsos,et al.  On power-law relationships of the Internet topology , 1999, SIGCOMM '99.

[35]  J. Tenenbaum,et al.  A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.

[36]  C. Lee Giles,et al.  Efficient identification of Web communities , 2000, KDD '00.

[37]  Fan Chung Graham,et al.  A random graph model for massive graphs , 2000, STOC '00.

[38]  Martin G. Everett,et al.  Models of core/periphery structures , 2000, Soc. Networks.

[39]  Andrei Z. Broder,et al.  Graph structure in the Web , 2000, Comput. Networks.

[40]  Doyle,et al.  Power laws, highly optimized tolerance, and generalized source coding , 2000, Physical review letters.

[41]  Santosh S. Vempala,et al.  On clusterings-good, bad and spectral , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[42]  Eli Upfal,et al.  Stochastic models for the Web graph , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[43]  Xiaoyi Gao,et al.  Human population structure detection via multilocus genotype clustering , 2007, BMC Genetics.

[44]  A. Barabasi,et al.  Lethality and centrality in protein networks , 2001, Nature.

[45]  Fan Chung Graham,et al.  A Random Graph Model for Power Law Graphs , 2001, Exp. Math..

[46]  Michael William Newman,et al.  The Laplacian spectrum of graphs , 2001 .

[47]  Linyuan Lu,et al.  The diameter of random massive graphs , 2001, SODA '01.

[48]  Christos Faloutsos,et al.  Identifying Web Browsing Trends and Patterns , 2001, Computer.

[49]  Albert-László Barabási,et al.  Statistical mechanics of complex networks , 2001, ArXiv.

[50]  Michalis Faloutsos,et al.  A simple conceptual model for the Internet topology , 2001, GLOBECOM'01. IEEE Global Telecommunications Conference (Cat. No.01CH37270).

[51]  John Doyle,et al.  Complexity and robustness , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[52]  S. N. Dorogovtsev,et al.  Evolution of networks , 2001, cond-mat/0106144.

[53]  M E J Newman,et al.  Community structure in social and biological networks , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[54]  Christos H. Papadimitriou,et al.  Heuristically Optimized Trade-Offs: A New Paradigm for Power Laws in the Internet , 2002, ICALP.

[55]  F. Chung,et al.  Eigenvalues of Random Power law Graphs , 2003 .

[56]  Ian T. Foster,et al.  Mapping the Gnutella Network: Properties of Large-Scale Peer-to-Peer Systems and Implications for System Design , 2002, ArXiv.

[57]  C. Lee Giles,et al.  Self-Organization and Identification of Web Communities , 2002, Computer.

[58]  A. Barabasi,et al.  Hierarchical Organization of Modularity in Metabolic Networks , 2002, Science.

[59]  Christos H. Papadimitriou,et al.  On the Eigenvalue Power Law , 2002, RANDOM.

[60]  F. Chung,et al.  Connected Components in Random Graphs with Given Expected Degree Sequences , 2002 .

[61]  F. Chung,et al.  The average distances in random graphs with given expected degrees , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[62]  Jon M. Kleinberg,et al.  Overview of the 2003 KDD Cup , 2003, SKDD.

[63]  Renato D. C. Monteiro,et al.  A nonlinear programming algorithm for solving semidefinite programs via low-rank factorization , 2003, Math. Program..

[64]  Fang Wu,et al.  Finding communities in linear time: a physics approach , 2003, ArXiv.

[65]  Bart Selman,et al.  Natural communities in large linked networks , 2003, KDD '03.

[66]  M. Newman,et al.  On the uniform generation of random graphs with prescribed degree sequences , 2003, cond-mat/0312028.

[67]  D. Lusseau,et al.  The bottlenose dolphin community of Doubtful Sound features a large proportion of long-lasting associations , 2003, Behavioral Ecology and Sociobiology.

[68]  Amin Saberi,et al.  On certain connectivity properties of the Internet topology , 2003, 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings..

[69]  Christos Gkantsidis,et al.  Conductance and congestion in power law graphs , 2003, SIGMETRICS '03.

[70]  Fan Chung Graham,et al.  The Spectra of Random Graphs with Given Expected Degrees , 2004, Internet Math..

[71]  Alan M. Frieze,et al.  High Degree Vertices and Eigenvalues in the Preferential Attachment Graph , 2005, Internet Math..

[72]  Albert-László Barabási,et al.  Hierarchical organization in complex networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[73]  Fan Chung Graham,et al.  The Average Distance in a Random Graph with Given Expected Degrees , 2004, Internet Math..

[74]  Robert E. Tarjan,et al.  Graph Clustering and Minimum Cut Trees , 2004, Internet Math..

[75]  Mark E. J. Newman,et al.  The Structure and Function of Complex Networks , 2003, SIAM Rev..

[76]  James Abello,et al.  Hierarchical graph maps , 2004, Comput. Graph..

[77]  Kevin J. Lang Finding good nearly balanced cuts in power law graphs , 2004 .

[78]  Yiming Yang,et al.  Introducing the Enron Corpus , 2004, CEAS.

[79]  Satish Rao,et al.  A Flow-Based Method for Improving the Expansion or Conductance of Graph Cuts , 2004, IPCO.

[80]  Sanjeev Arora,et al.  O(/spl radic/log n) approximation to SPARSEST CUT in O/spl tilde/(n/sup 2/) time , 2004, 45th Annual IEEE Symposium on Foundations of Computer Science.

[81]  Shang-Hua Teng,et al.  Nearly-linear time algorithms for graph partitioning, graph sparsification, and solving linear systems , 2003, STOC '04.

[82]  Christos Faloutsos,et al.  R-MAT: A Recursive Model for Graph Mining , 2004, SDM.

[83]  M. Newman,et al.  Finding community structure in very large networks. , 2004, Physical review. E, Statistical, nonlinear, and soft matter physics.

[84]  R. Guimerà,et al.  Modularity from fluctuations in random graphs and complex networks. , 2004, Physical review. E, Statistical, nonlinear, and soft matter physics.

[85]  Elad Hazan,et al.  O(/spl radic/log n) approximation to SPARSEST CUT in O/spl tilde/(n/sup 2/) time , 2004, 45th Annual IEEE Symposium on Foundations of Computer Science.

[86]  Mark Newman,et al.  Detecting community structure in networks , 2004 .

[87]  Claudio Castellano,et al.  Defining and identifying communities in networks. , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[88]  Bart Selman,et al.  Tracking evolving communities in large linked networks , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[89]  M E J Newman,et al.  Finding and evaluating community structure in networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[90]  Satish Rao,et al.  Expander flows, geometric embeddings and graph partitioning , 2004, STOC '04.

[91]  George Karypis,et al.  Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering , 2004, Machine Learning.

[92]  M. Newman Power laws, Pareto distributions and Zipf's law , 2005 .

[93]  Sanjeev Arora,et al.  O( p logn) Approximation to Sparsest Cut in O(n2) Time , 2004, FOCS 2004.

[94]  L. da F. Costa,et al.  Characterization of complex networks: A survey of measurements , 2005, cond-mat/0505185.

[95]  Alessandro Flammini,et al.  Characterization and modeling of protein–protein interaction networks , 2005 .

[96]  Yanjun Qi,et al.  Random Forest Similarity for Protein-Protein Interaction Prediction from Multiple Sources , 2004, Pacific Symposium on Biocomputing.

[97]  A. Clauset Finding local community structure in networks. , 2005, Physical review. E, Statistical, nonlinear, and soft matter physics.

[98]  Erik M Bollt,et al.  Local method for detecting communities. , 2005, Physical review. E, Statistical, nonlinear, and soft matter physics.

[99]  M. E. J. Newman,et al.  Power laws, Pareto distributions and Zipf's law , 2005 .

[100]  Christos Faloutsos,et al.  Graphs over time: densification laws, shrinking diameters and possible explanations , 2005, KDD '05.

[101]  Christos Faloutsos,et al.  Realistic, Mathematically Tractable Graph Generation and Evolution, Using Kronecker Multiplication , 2005, PKDD.

[102]  Walter Willinger,et al.  Towards a Theory of Scale-Free Graphs: Definition, Properties, and Implications , 2005, Internet Math..

[103]  Béla Bollobás,et al.  Mathematical results on scale‐free random graphs , 2005 .

[104]  Padhraic Smyth,et al.  A Spectral Clustering Approach To Finding Communities in Graph , 2005, SDM.

[105]  Uriel Feige,et al.  Spectral techniques applied to sparse random graphs , 2005, Random Struct. Algorithms.

[106]  P. Holme Core-periphery organization of complex networks. , 2005, Physical review. E, Statistical, nonlinear, and soft matter physics.

[107]  T. Vicsek,et al.  Uncovering the overlapping community structure of complex networks in nature and society , 2005, Nature.

[108]  U. Feige,et al.  Spectral techniques applied to sparse random graphs , 2005 .

[109]  Leon Danon,et al.  Comparing community structure identification , 2005, cond-mat/0505245.

[110]  M. Newman,et al.  Finding community structure in networks using the eigenvectors of matrices. , 2006, Physical review. E, Statistical, nonlinear, and soft matter physics.

[111]  Jon M. Kleinberg,et al.  Group formation in large social networks: membership, growth, and evolution , 2006, KDD '06.

[112]  N. Linial,et al.  Expander Graphs and their Applications , 2006 .

[113]  Michalis Faloutsos,et al.  Jellyfish: A conceptual model for the as Internet topology , 2006, Journal of Communications and Networks.

[114]  Christos Faloutsos,et al.  Graph mining: Laws, generators, and algorithms , 2006, CSUR.

[115]  Cristopher Moore,et al.  Structural Inference of Hierarchies in Networks , 2006, SNA@ICML.

[116]  M E J Newman,et al.  Modularity and community structure in networks. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[117]  Fan Chung Graham,et al.  Local Graph Partitioning using PageRank Vectors , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[118]  Alan M. Frieze,et al.  Random graphs , 2006, SODA '06.

[119]  Kevin J. Lang,et al.  Communities from seed sets , 2006, WWW '06.

[120]  Ernesto Estrada,et al.  Spectral scaling and good expansion properties in complex networks , 2006, Europhysics Letters (EPL).

[121]  Satish Rao,et al.  Graph partitioning using single commodity flows , 2006, STOC '06.

[122]  F. Chung,et al.  Complex Graphs and Networks , 2006 .

[123]  Linyuan Lu,et al.  Complex Graphs and Networks (CBMS Regional Conference Series in Mathematics) , 2006 .

[124]  B. Reed,et al.  Faster Mixing and Small Bottlenecks , 2006 .

[125]  Qi Xuan,et al.  Growth model for complex networks with hierarchical and modular structures. , 2006, Physical review. E, Statistical, nonlinear, and soft matter physics.

[126]  Fan Chung Graham,et al.  The Volume of the Giant Component of a Random Graph with Given Expected Degrees , 2006, SIAM J. Discret. Math..

[127]  Sergey N. Dorogovtsev,et al.  K-core Organization of Complex Networks , 2005, Physical review letters.

[128]  M. Hastings Community detection as an inference problem. , 2006, Physical review. E, Statistical, nonlinear, and soft matter physics.

[129]  Ravi Kumar,et al.  Structure and evolution of online social networks , 2006, KDD '06.

[130]  Christos Faloutsos,et al.  Visualization of large networks with min-cut plots, A-plots and R-MAT , 2007, Int. J. Hum. Comput. Stud..

[131]  Fan Chung,et al.  The heat kernel as the pagerank of a graph , 2007, Proceedings of the National Academy of Sciences.

[132]  Vijaya Ramachandran,et al.  The diameter of sparse random graphs , 2007, Random Struct. Algorithms.

[133]  F. Chung Four proofs for the Cheeger inequality and graph partition algorithms , 2007 .

[134]  Sanjeev Arora,et al.  A combinatorial, primal-dual approach to semidefinite programs , 2007, STOC '07.

[135]  J. Reichardt,et al.  Partitioning and modularity of graphs with arbitrary degree distribution. , 2006, Physical review. E, Statistical, nonlinear, and soft matter physics.

[136]  James Bennett,et al.  The Netflix Prize , 2007 .

[137]  Inderjit S. Dhillon,et al.  Weighted Graph Cuts without Eigenvectors A Multilevel Approach , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[138]  E A Leicht,et al.  Mixture models and exploratory analysis in networks , 2006, Proceedings of the National Academy of Sciences.

[139]  Ernesto Estrada Topological structural classes of complex networks. , 2007, Physical review. E, Statistical, nonlinear, and soft matter physics.

[140]  B. Reed,et al.  The Evolution of the Mixing Rate , 2007, math/0701474.

[141]  Alan M. Frieze,et al.  A Geometric Preferential Attachment Model of Networks II , 2007, Internet Math..

[142]  Alan M. Frieze,et al.  A Geometric Preferential Attachment Model of Networks , 2006, Internet Math..

[143]  Christos Faloutsos,et al.  Graph evolution: Densification and shrinking diameters , 2006, TKDD.

[144]  S. Fortunato,et al.  Resolution limit in community detection , 2006, Proceedings of the National Academy of Sciences.

[145]  Andrew V. Goldberg,et al.  Experimental Evaluation of Parametric Max-Flow Algorithms , 2007, WEA.

[146]  S. Kiesler,et al.  Applying Common Identity and Bond Theory to Design of Online Communities , 2007 .

[147]  Réka Albert,et al.  Near linear time algorithm to detect community structures in large-scale networks. , 2007, Physical review. E, Statistical, nonlinear, and soft matter physics.

[148]  Christos Faloutsos,et al.  Scalable modeling of real graphs using Kronecker multiplication , 2007, ICML '07.

[149]  Martin Rosvall,et al.  An information-theoretic framework for resolving community structure in complex networks , 2007, Proceedings of the National Academy of Sciences.

[150]  Jure Leskovec,et al.  The dynamics of viral marketing , 2005, EC '06.

[151]  F. Chung Random walks and local cuts in graphs , 2007 .

[152]  V. Ramachandran,et al.  The diameter of sparse random graphs , 2007 .

[153]  Marián Boguñá,et al.  Navigability of Complex Networks , 2007, ArXiv.

[154]  Christos Faloutsos,et al.  Patterns of Cascading Behavior in Large Blog Graphs , 2007, SDM.

[155]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[156]  Jure Leskovec,et al.  Statistical properties of community structure in large social and information networks , 2008, WWW.

[157]  D. Zinoviev Topology and Geometry of Online Social Networks , 2008, ArXiv.

[158]  M. Newman,et al.  Robustness of community structure in networks. , 2007, Physical review. E, Statistical, nonlinear, and soft matter physics.

[159]  Natali Gulbahce,et al.  The art of community detection , 2008, BioEssays : news and reviews in molecular, cellular and developmental biology.

[160]  An-Ping Zeng,et al.  Centrality, Network Capacity, and Modularity as Parameters to Analyze the Core-Periphery Structure in Metabolic Networks , 2008, Proceedings of the IEEE.

[161]  Martin Rosvall,et al.  Maps of random walks on complex networks reveal community structure , 2007, Proceedings of the National Academy of Sciences.

[162]  Ulrik Brandes,et al.  Engineering graph clustering: Models and experimental evaluation , 2008, JEAL.

[163]  Albert-László Barabási,et al.  Understanding the Spreading Patterns of Mobile Phone Viruses , 2009, Science.

[164]  Mark E. J. Newman,et al.  Power-Law Distributions in Empirical Data , 2007, SIAM Rev..

[165]  Ernesto Estrada,et al.  Communicability graph and community structures in complex networks , 2009, Appl. Math. Comput..

[166]  Santo Fortunato,et al.  Community detection in graphs , 2009, ArXiv.

[167]  Andrea Lancichinetti,et al.  Community detection algorithms: a comparative analysis: invited presentation, extended abstract , 2009, VALUETOOLS.

[168]  Dmitry Zinoviev,et al.  Toward Understanding Friendship in Online Social Networks , 2009, ArXiv.

[169]  R. Lambiotte,et al.  Line graphs, link partitions, and overlapping communities. , 2009, Physical review. E, Statistical, nonlinear, and soft matter physics.

[170]  Yun Chi,et al.  Combining link and content for community detection: a discriminative approach , 2009, KDD.

[171]  Azadeh Iranmehr,et al.  Trust Management for Semantic Web , 2009, 2009 Second International Conference on Computer and Electrical Engineering.

[172]  Jure Leskovec,et al.  Empirical comparison of algorithms for network community detection , 2010, WWW '10.

[173]  Christos Faloutsos,et al.  Kronecker Graphs: An Approach to Modeling Networks , 2008, J. Mach. Learn. Res..

[174]  Sune Lehmann,et al.  Link communities reveal multiscale complexity in networks , 2009, Nature.

[175]  Charu C. Aggarwal,et al.  Graph Clustering , 2010, Encyclopedia of Machine Learning and Data Mining.

[176]  Tom A. B. Snijders,et al.  Social Network Analysis , 2011, International Encyclopedia of Statistical Science.

[177]  Aaas News,et al.  Book Reviews , 1893, Buffalo Medical and Surgical Journal.

[178]  U. Feige,et al.  Spectral Graph Theory , 2015 .