Tools for large graph mining

Graphs show up in a surprisingly diverse set of disciplines, ranging from computer networks to sociology, biology, ecology and many more. How do such “normal” graphs look like? How can we spot abnormal subgraphs within them? Which nodes/edges are “suspicious?” How does a virus spread over a graph? Answering these questions is vital for outlier detection (such as terrorist cells, money laundering rings), forecasting, simulations (how well will a new protocol work on a realistic computer network?), immunization campaigns and many other applications. We attempt to answer these questions in two parts. First, we answer questions targeted at applications : what patterns/properties of a graph are important for solving specific problems? Here, we investigate the propagation behavior of a computer virus over a network, and find a simple formula for the epidemic threshold (beyond which any viral outbreak might become an epidemic). We find an “information survival threshold” which determines whether, in a sensor or P2P network with failing nodes and links, a piece of information will survive or not. We also develop a scalable, parameter-free method for finding groups of “similar” nodes in a graph, corresponding to homogeneous regions (or CrossAssociations) in the binary adjacency matrix of the graph. This can help navigate the structure of the graph, and find un-obvious patterns. In the second part of our work, we investigate recurring patterns in real-world graphs, to gain a deeper understanding of their structure. This leads to the development of the R-MAT model of graph generation for creating synthetic but “realistic” graphs, which match many of the patterns found in real-world graphs, including power-law and lognormal degree distributions, small diameter and “community” effects.

[1]  Donald F. Towsley,et al.  On distinguishing between Internet power law topology generators , 2002, Proceedings.Twenty-First Annual Joint Conference of the IEEE Computer and Communications Societies.

[2]  Sy-Yen Kuo,et al.  Communication strategies for heartbeat-style failure detectors in wireless ad hoc networks , 2003, 2003 International Conference on Dependable Systems and Networks, 2003. Proceedings..

[3]  Michelle Girvan,et al.  Optimal design, robustness, and risk aversion. , 2002, Physical review letters.

[4]  Inderjit S. Dhillon,et al.  Concept Decompositions for Large Sparse Text Data Using Clustering , 2004, Machine Learning.

[5]  Kenneth P. Birman,et al.  Bimodal multicast , 1999, TOCS.

[6]  Christos H. Papadimitriou,et al.  Heuristically Optimized Trade-Offs: A New Paradigm for Power Laws in the Internet , 2002, ICALP.

[7]  Fan Chung Graham,et al.  Random evolution in massive graphs , 2001, Proceedings 2001 IEEE International Conference on Cluster Computing.

[8]  Duncan J. Watts,et al.  Six Degrees: The Science of a Connected Age , 2003 .

[9]  G. B. A. Barab'asi Competition and multiscaling in evolving networks , 2000, cond-mat/0011029.

[10]  S. Shen-Orr,et al.  Network motifs: simple building blocks of complex networks. , 2002, Science.

[11]  Daniel A. Keim,et al.  An Efficient Approach to Clustering in Large Multimedia Databases with Noise , 1998, KDD.

[12]  Amin Saberi,et al.  Exploring the community structure of newsgroups , 2004, KDD.

[13]  Ramanathan V. Guha,et al.  Propagation of trust and distrust , 2004, WWW '04.

[14]  Anne-Marie Kermarrec,et al.  Efficient epidemic-style protocols for reliable and scalable multicast , 2002, 21st IEEE Symposium on Reliable Distributed Systems, 2002. Proceedings..

[15]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[16]  Randy H. Katz,et al.  On failure detection algorithms in overlay networks , 2005, Proceedings IEEE 24th Annual Joint Conference of the IEEE Computer and Communications Societies..

[17]  S. Dongen Graph clustering by flow simulation , 2000 .

[18]  Christos Faloutsos,et al.  R-MAT: A Recursive Model for Graph Mining , 2004, SDM.

[19]  Valdis E. Krebs,et al.  Mapping Networks of Terrorist Cells , 2001 .

[20]  Albert-László Barabási,et al.  Linked: The New Science of Networks , 2002 .

[21]  Derek de Solla Price,et al.  A general theory of bibliometric and other cumulative advantage processes , 1976, J. Am. Soc. Inf. Sci..

[22]  Béla Bollobás,et al.  Directed scale-free graphs , 2003, SODA '03.

[23]  Jon M. Kleinberg,et al.  Spatial gossip and resource location protocols , 2001, JACM.

[24]  Stephen P. Boyd,et al.  Gossip and mixing times of random walks on random graphs , 2004, STOC 2004.

[25]  BERNARD M. WAXMAN,et al.  Routing of multipoint connections , 1988, IEEE J. Sel. Areas Commun..

[26]  John Kubiatowicz,et al.  Handling churn in a DHT , 2004 .

[27]  Dana Ron,et al.  On Finding Large Conjunctive Clusters , 2003, COLT.

[28]  M E J Newman,et al.  Identity and Search in Social Networks , 2002, Science.

[29]  Sugih Jamin,et al.  Inet-3.0: Internet Topology Generator , 2002 .

[30]  Jon M. Kleinberg,et al.  The Web as a Graph: Measurements, Models, and Methods , 1999, COCOON.

[31]  R. Durrett,et al.  The Contact Process on a Finite Set. III: The Critical Case , 1989 .

[32]  Jeffrey O. Kephart,et al.  Directed-graph epidemiological models of computer viruses , 1991, Proceedings. 1991 IEEE Computer Society Symposium on Research in Security and Privacy.

[33]  R Pastor-Satorras,et al.  Dynamical and correlation properties of the internet. , 2001, Physical review letters.

[34]  N. Ling The Mathematical Theory of Infectious Diseases and its applications , 1978 .

[35]  Christos Faloutsos,et al.  Fully automatic cross-associations , 2004, KDD.

[36]  P. Bonacich Power and Centrality: A Family of Measures , 1987, American Journal of Sociology.

[37]  Yang Wang,et al.  Modeling the effects of timing parameters on virus propagation , 2003, WORM '03.

[38]  Alessandro Vespignani,et al.  Epidemic dynamics in finite size scale-free networks. , 2002, Physical review. E, Statistical, nonlinear, and soft matter physics.

[39]  Aidong Zhang,et al.  Mining multiple phenotype structures underlying gene expression profiles , 2003, CIKM '03.

[40]  Alexander Grey,et al.  The Mathematical Theory of Infectious Diseases and Its Applications , 1977 .

[41]  Donald F. Towsley,et al.  The effect of network topology on the spread of epidemics , 2005, Proceedings IEEE 24th Annual Joint Conference of the IEEE Computer and Communications Societies..

[42]  Matthew Richardson,et al.  Mining knowledge-sharing sites for viral marketing , 2002, KDD.

[43]  Alan D. George,et al.  Simulative performance analysis of gossip failure detection for scalable distributed systems , 2004, Cluster Computing.

[44]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[45]  Albert-László Barabási,et al.  Statistical mechanics of complex networks , 2001, ArXiv.

[46]  Jeffrey O. Kephart,et al.  Measuring and modeling computer virus prevalence , 1993, Proceedings 1993 IEEE Computer Society Symposium on Research in Security and Privacy.

[47]  Christos Faloutsos,et al.  Identifying Web Browsing Trends and Patterns , 2001, Computer.

[48]  Christian Borgs,et al.  Degree Distribution of Competition-Induced Preferential Attachment Graphs , 2005, Combinatorics, Probability and Computing.

[49]  Alessandro Vespignani,et al.  Immunization of complex networks. , 2001, Physical review. E, Statistical, nonlinear, and soft matter physics.

[50]  Fan Chung Graham,et al.  A random graph model for massive graphs , 2000, STOC '00.

[51]  Alan D. George,et al.  Performance analysis of flat and layered gossip services for failure detection and consensus in scalable heterogeneous clusters , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.

[52]  Ulrik Brandes,et al.  Experiments on Graph Clustering Algorithms , 2003, ESA.

[53]  I. Guyon,et al.  Detecting stable clusters using principal component analysis. , 2003, Methods in molecular biology.

[54]  Wei Hong,et al.  The design of an acquisitional query processor for sensor networks , 2003, SIGMOD '03.

[55]  R. Durrett,et al.  The Contact Process on a Finite Set. II , 1988 .

[56]  M. Newman,et al.  Finding community structure in very large networks. , 2004, Physical review. E, Statistical, nonlinear, and soft matter physics.

[57]  Michalis Faloutsos,et al.  On power-law relationships of the Internet topology , 1999, SIGCOMM '99.

[58]  Hsinchun Chen,et al.  COPLINK Center: Information and Knowledge Management for Law Enforcement , 2004, DG.O.

[59]  Jeffrey Considine,et al.  Approximate aggregation techniques for sensor databases , 2004, Proceedings. 20th International Conference on Data Engineering.

[60]  Santosh S. Vempala,et al.  Latent semantic indexing: a probabilistic analysis , 1998, PODS '98.

[61]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[62]  Indranil Gupta,et al.  Kelips: Building an Efficient and Stable P2P DHT through Increased Memory and Background Overhead , 2003, IPTPS.

[63]  M. Newman,et al.  Random graphs with arbitrary degree distributions and their applications. , 2000, Physical review. E, Statistical, nonlinear, and soft matter physics.

[64]  Chalee Asavathiratham,et al.  The influence model: a tractable representation for the dynamics of networked Markov chains , 2001 .

[65]  Naftali Tishby,et al.  Multivariate Information Bottleneck , 2001, Neural Computation.

[66]  Robbert van Renesse,et al.  A Gossip-Style Failure Detection Service , 2009 .

[67]  J. Rissanen A UNIVERSAL PRIOR FOR INTEGERS AND ESTIMATION BY MINIMUM DESCRIPTION LENGTH , 1983 .

[68]  M E J Newman,et al.  Community structure in social and biological networks , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[69]  Matthew Richardson,et al.  Mining the network value of customers , 2001, KDD '01.

[70]  H. Balakrishnan,et al.  Mitigating congestion in wireless sensor networks , 2004, SenSys '04.

[71]  Jon M. Kleinberg,et al.  Small-World Phenomena and the Dynamics of Information , 2001, NIPS.

[72]  Doyle,et al.  Power laws, highly optimized tolerance, and generalized source coding , 2000, Physical review letters.

[73]  Jon M. Kleinberg,et al.  Protocols and impossibility results for gossip-based communication mechanisms , 2002, The 43rd Annual IEEE Symposium on Foundations of Computer Science, 2002. Proceedings..

[74]  Jian Pei,et al.  Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[75]  C. Lee Giles,et al.  Efficient identification of Web communities , 2000, KDD '00.

[76]  Michalis Faloutsos,et al.  A simple conceptual model for the Internet topology , 2001, GLOBECOM'01. IEEE Global Telecommunications Conference (Cat. No.01CH37270).

[77]  Greg Hamerly,et al.  Learning the k in k-means , 2003, NIPS.

[78]  Ibrahim Matta,et al.  On the origin of power laws in Internet topologies , 2000, CCRV.

[79]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[80]  Johannes Gehrke,et al.  Gossip-based computation of aggregate information , 2003, 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings..

[81]  Christos H. Papadimitriou,et al.  On the Eigenvalue Power Law , 2002, RANDOM.

[82]  M. Hirsch,et al.  Differential Equations, Dynamical Systems, and Linear Algebra , 1974 .

[83]  Inderjit S. Dhillon,et al.  Information-theoretic co-clustering , 2003, KDD '03.

[84]  William H. Press,et al.  Numerical recipes in C , 2002 .

[85]  Srinivasan Seshan,et al.  IrisNet: An Architecture for a Worldwide Sensor Web , 2003, IEEE Pervasive Comput..

[86]  Robbert van Renesse,et al.  Scalable and Secure Resource Location , 2000, HICSS.

[87]  Andrew W. Moore,et al.  X-means: Extending K-means with Efficient Estimation of the Number of Clusters , 2000, ICML.

[88]  J. Moody Race, School Integration, and Friendship Segregation in America1 , 2001, American Journal of Sociology.

[89]  Masaru Kitsuregawa,et al.  An approach to relate the Web communities through bipartite graphs , 2001, Proceedings of the Second International Conference on Web Information Systems Engineering.

[90]  H E Stanley,et al.  Classes of small-world networks. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[91]  Deepayan Chakrabarti,et al.  AutoPart: Parameter-Free Graph Partitioning and Outlier Detection , 2004, PKDD.

[92]  J M Carlson,et al.  Highly optimized tolerance: a mechanism for power laws in designed systems. , 1999, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[93]  Vipin Kumar,et al.  Multilevel Algorithms for Multi-Constraint Graph Partitioning , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[94]  Ramesh Govindan,et al.  Heuristics for Internet map discovery , 2000, Proceedings IEEE INFOCOM 2000. Conference on Computer Communications. Nineteenth Annual Joint Conference of the IEEE Computer and Communications Societies (Cat. No.00CH37064).

[95]  David M. Pennock,et al.  Winners don't take all: Characterizing the competition for links on the web , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[96]  Christos Faloutsos,et al.  Epidemic spreading in real networks: an eigenvalue viewpoint , 2003, 22nd International Symposium on Reliable Distributed Systems, 2003. Proceedings..

[97]  Walter Willinger,et al.  Network topologies, power laws, and hierarchy , 2002, CCRV.

[98]  Joao Antonio Pereira,et al.  Linked: The new science of networks , 2002 .

[99]  Gediminas Adomavicius,et al.  Handling very large numbers of association rules in the analysis of microarray data , 2002, KDD.

[100]  Noga Alon,et al.  Spectral Techniques in Graph Algorithms , 1998, LATIN.

[101]  H. Simon,et al.  ON A CLASS OF SKEW DISTRIBUTION FUNCTIONS , 1955 .

[102]  Ravi Kumar,et al.  Extracting Large-Scale Knowledge Bases from the Web , 1999, VLDB.

[103]  Hannu Toivonen,et al.  Discovery of frequent DATALOG patterns , 1999, Data Mining and Knowledge Discovery.

[104]  B. Bollobás The evolution of random graphs , 1984 .

[105]  R. Pastor-Satorras,et al.  Epidemic spreading in correlated complex networks. , 2002, Physical review. E, Statistical, nonlinear, and soft matter physics.

[106]  Tamara G. Kolda,et al.  A semidiscrete matrix decomposition for latent semantic indexing information retrieval , 1998, TOIS.

[107]  Deborah Estrin,et al.  An Empirical Study of Epidemic Algorithms in Large Scale Multihop Wireless Networks , 2002 .

[108]  Michael F. Schwartz,et al.  Discovering shared interests using graph analysis , 1993, CACM.

[109]  Patrick Th. Eugster,et al.  Route driven gossip: probabilistic reliable multicast in ad hoc networks , 2003, IEEE INFOCOM 2003. Twenty-second Annual Joint Conference of the IEEE Computer and Communications Societies (IEEE Cat. No.03CH37428).

[110]  A. Rbnyi ON THE EVOLUTION OF RANDOM GRAPHS , 2001 .

[111]  Hawoong Jeong,et al.  Modeling the Internet's large-scale topology , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[112]  Christopher R. Palmer,et al.  Generating network topologies that obey power laws , 2000, Globecom '00 - IEEE. Global Telecommunications Conference. Conference Record (Cat. No.00CH37137).

[113]  Neo D. Martinez Artifacts or Attributes? Effects of Resolution on the Little Rock Lake Food Web , 1991 .

[114]  Duncan J. Watts,et al.  Collective dynamics of ‘small-world’ networks , 1998, Nature.

[115]  Charles R. MacCluer,et al.  The Many Proofs and Applications of Perron's Theorem , 2000, SIAM Rev..

[116]  Albert,et al.  Emergence of scaling in random networks , 1999, Science.

[117]  Srinivasan Seshan,et al.  Synopsis diffusion for robust aggregation in sensor networks , 2004, SenSys '04.

[118]  Eli Upfal,et al.  Using PageRank to Characterize Web Structure , 2002, Internet Math..

[119]  Amit Kumar,et al.  Deterministic wavelet thresholding for maximum-error metrics , 2004, PODS.

[120]  Christos Faloutsos,et al.  The "DGX" distribution for mining massive, skewed data , 2001, KDD '01.

[121]  Ian T. Foster,et al.  Mapping the Gnutella Network: Properties of Large-Scale Peer-to-Peer Systems and Implications for System Design , 2002, ArXiv.

[122]  Jie Wu,et al.  Small Worlds: The Dynamics of Networks between Order and Randomness , 2003 .

[123]  T. E. Harris Contact Interactions on a Lattice , 1974 .

[124]  Scott Shenker,et al.  Epidemic algorithms for replicated database maintenance , 1988, OPSR.

[125]  Christos Faloutsos,et al.  ANF: a fast and scalable tool for data mining in massive graphs , 2002, KDD.

[126]  Umeshwar Dayal,et al.  K-Harmonic Means - A Spatial Clustering Algorithm with Boosting , 2000, TSDM.

[127]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[128]  Alessandro Vespignani,et al.  Epidemic spreading in scale-free networks. , 2000, Physical review letters.

[129]  John Stanley,et al.  Applying Video Sensor Networks to Nearshore Environment Monitoring , 2003, IEEE Pervasive Comput..

[130]  S. Redner How popular is your paper? An empirical study of the citation distribution , 1998, cond-mat/9804163.

[131]  A. Barabasi,et al.  Evolution of the social network of scientific collaborations , 2001, cond-mat/0104162.

[132]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[133]  Vipin Kumar,et al.  Chameleon: Hierarchical Clustering Using Dynamic Modeling , 1999, Computer.

[134]  Albert-László Barabási,et al.  Error and attack tolerance of complex networks , 2000, Nature.

[135]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[136]  S. Borgatti,et al.  Social Networks of Drug Users in High-Risk Sites: Finding the Connections , 2002, AIDS and Behavior.

[137]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.