TellTail: Fast Scoring and Detection of Dense Subgraphs

Suppose you visit an e-commerce site, and see that 50 users each reviewed almost all of the same 500 products several times each: would you get suspicious? Similarly, given a Twitter follow graph, how can we design principled measures for identifying surprisingly dense subgraphs? Dense subgraphs often indicate interesting structure, such as network attacks in network traffic graphs. However, most existing dense subgraph measures either do not model normal variation, or model it using an Erdős-Renyi assumption - but this assumption has been discredited decades ago. What is the right assumption then? We propose a novel application of extreme value theory to the dense subgraph problem, which allows us to propose measures and algorithms which evaluate the surprisingness of a subgraph probabilistically, without requiring restrictive assumptions (e.g. Erdős-Renyi). We then improve the practicality of our approach by incorporating empirical observations about dense subgraph patterns in real graphs, and by proposing a fast pruning-based search algorithm. Our approach (a) provides theoretical guarantees of consistency, (b) scales quasi-linearly, and (c) outperforms baselines in synthetic and ground truth settings.

[1]  Hyun Ah Song,et al.  FRAUDAR: Bounding Graph Fraud in the Face of Camouflage , 2016, KDD.

[2]  Hosung Park,et al.  What is Twitter, a social network or a news media? , 2010, WWW '10.

[3]  Ulrik Brandes,et al.  On Finding Graph Clusterings with Maximum Modularity , 2007, WG.

[4]  Eric P. Smith,et al.  An Introduction to Statistical Modeling of Extreme Values , 2002, Technometrics.

[5]  Carsten Wiuf,et al.  Subnets of scale-free networks are not scale-free: sampling properties of networks. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[6]  C. Lee Giles,et al.  Efficient identification of Web communities , 2000, KDD '00.

[7]  Stephen B. Seidman,et al.  A graph‐theoretic generalization of the clique concept* , 1978 .

[8]  Xiang Li,et al.  Network Clustering via Maximizing Modularity: Approximation Algorithms and Theoretical Limits , 2015, 2015 IEEE International Conference on Data Mining.

[9]  Kumar Chellapilla,et al.  Finding Dense Subgraphs with Size Bounds , 2009, WAW.

[10]  Mohammed J. Zaki,et al.  Towards a Better Quality Metric for Graph Cluster Evaluation , 2012, J. Inf. Data Manag..

[11]  Claudio Castellano,et al.  Defining and identifying communities in networks. , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[12]  Jure Leskovec,et al.  Empirical comparison of algorithms for network community detection , 2010, WWW '10.

[13]  Christos Faloutsos,et al.  Beyond Blocks: Hyperbolic Community Detection , 2014, ECML/PKDD.

[14]  S. Grimshaw Computing Maximum Likelihood Estimates for the Generalized Pareto Distribution , 1993 .

[15]  Christos Faloutsos,et al.  Spotting Suspicious Behaviors in Multimodal Data: A General Metric and Algorithms , 2016, IEEE Transactions on Knowledge and Data Engineering.

[16]  Charalampos E. Tsourakakis,et al.  Denser than the densest subgraph: extracting optimal quasi-cliques with quality guarantees , 2013, KDD.

[17]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[18]  M E J Newman,et al.  Modularity and community structure in networks. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[19]  Derek Greene,et al.  Producing a unified graph representation from multiple social network views , 2013, WebSci.

[20]  L. Haan,et al.  Residual Life Time at Great Age , 1974 .

[21]  Samir Khuller,et al.  Dense Subgraphs with Restrictions and Applications to Gene Annotation Graphs , 2010, RECOMB.

[22]  Christos Faloutsos,et al.  CoreScope: Graph Mining Using k-Core Analysis — Patterns, Anomalies and Algorithms , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[23]  Satoshi Hara,et al.  Discounted average degree density metric and new algorithms for the densest subgraph problem , 2018, Networks.

[24]  Richard L. Smith Estimating tails of probability distributions , 1987 .

[25]  Charalampos E. Tsourakakis The K-clique Densest Subgraph Problem , 2015, WWW.

[26]  Reid Andersen,et al.  A local algorithm for finding dense subgraphs , 2007, TALG.

[27]  Patrick J. Wolfe,et al.  A Spectral Framework for Anomalous Subgraph Detection , 2014, IEEE Transactions on Signal Processing.

[28]  Daniel M. Roy,et al.  Bayesian Models of Graphs, Arrays and Other Exchangeable Random Structures , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Tijl De Bie,et al.  Subjective interestingness of subgraph patterns , 2016, Machine Learning.

[30]  Andrew V. Goldberg,et al.  Finding a Maximum Density Subgraph , 1984 .

[31]  Wojciech Szpankowski,et al.  Assessing Significance of Connectivity and Conservation in Protein Interaction Networks , 2006, RECOMB.

[32]  Chengqi Zhang,et al.  Locally Densest Subgraph Discovery , 2015, KDD.

[33]  C. Klüppelberg,et al.  Modelling Extremal Events , 1997 .