EagleMine: Vision-Guided Mining in Large Graphs

Given a graph with millions of nodes, what patterns exist in the distributions of node characteristics, and how can we detect them and separate anomalous nodes in a way similar to human vision? In this paper, we propose a vision-guided algorithm, EagleMine, to summarize micro-cluster patterns in two-dimensional histogram plots constructed from node features in a large graph. EagleMine utilizes a water-level tree to capture cluster structures according to vision-based intuition at multi-resolutions. EagleMine traverses the water-level tree from the root and adopts statistical hypothesis tests to determine the optimal clusters that should be fitted along the path, and summarizes each cluster with a truncated Gaussian distribution. Experiments on real data show that our method can find truncated and overlapped elliptical clusters, even when some baseline methods split one visual cluster into pieces with Gaussian spheres. To identify potentially anomalous microclusters, EagleMine also a designates score to measure the suspiciousness of outlier groups (i.e. node clusters) and outlier nodes, detecting bots and anomalous users with high accuracy in the real Microblog data.

[1]  Miriam Heynckes,et al.  The predictive vs. the simulating brain: A literature review on the mechanisms behind mimicry , 2016 .

[2]  James J. DiCarlo,et al.  How Does the Brain Solve Visual Object Recognition? , 2012, Neuron.

[3]  Charless C. Fowlkes,et al.  Contour Detection and Hierarchical Image Segmentation , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Ravi Kumar,et al.  Structure and evolution of online social networks , 2006, KDD '06.

[5]  Peter Elias,et al.  Universal codeword sets and representations of the integers , 1975, IEEE Trans. Inf. Theory.

[6]  Jure Leskovec,et al.  Inferring Networks of Substitutable and Complementary Products , 2015, KDD.

[7]  Christos Faloutsos,et al.  Fully automatic cross-associations , 2004, KDD.

[8]  Jian Pei,et al.  On mining cross-graph quasi-cliques , 2005, KDD '05.

[9]  M. Stephens EDF Statistics for Goodness of Fit and Some Comparisons , 1974 .

[10]  Larry Wasserman,et al.  All of Nonparametric Statistics (Springer Texts in Statistics) , 2006 .

[11]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[12]  Christos Faloutsos,et al.  BeatLex: Summarizing and Forecasting Time Series with Patterns , 2017, ECML/PKDD.

[13]  Hyun Ah Song,et al.  FRAUDAR: Bounding Graph Fraud in the Face of Camouflage , 2016, KDD.

[14]  Christian Böhm,et al.  Robust information-theoretic clustering , 2006, KDD '06.

[15]  Jure Leskovec,et al.  From amateurs to connoisseurs: modeling the evolution of user expertise through online reviews , 2013, WWW.

[16]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[17]  Christos Faloutsos,et al.  CatchSync: catching synchronized behavior in large directed graphs , 2014, KDD.

[18]  Charu C. Aggarwal,et al.  Graph Clustering , 2010, Encyclopedia of Machine Learning and Data Mining.

[19]  Robert L. Grossman,et al.  Graph-Theoretic Scagnostics , 2005, INFOVIS.

[20]  Danai Koutra,et al.  Net-Ray: Visualizing and Mining Billion-Scale Graphs , 2014, PAKDD.

[21]  Andrea Lancichinetti,et al.  Community detection algorithms: a comparative analysis: invited presentation, extended abstract , 2009, VALUETOOLS.

[22]  Christos Faloutsos,et al.  EigenSpokes: Surprising Patterns and Scalable Community Chipping in Large Graphs , 2010, PAKDD.

[23]  Luc Vincent,et al.  Watersheds in Digital Spaces: An Efficient Algorithm Based on Immersion Simulations , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[24]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[25]  Greg Hamerly,et al.  Learning the k in k-means , 2003, NIPS.

[26]  Christos Faloutsos,et al.  Inferring Strange Behavior from Connectivity Pattern in Social Networks , 2014, PAKDD.

[27]  Svetlozar T. Rachev,et al.  Composite Goodness-of-Fit Tests for Left-Truncated Loss Samples , 2015 .

[28]  Christos Faloutsos,et al.  Spectral Analysis for Billion-Scale Graphs: Discoveries and Implementation , 2011, PAKDD.

[29]  Rongrong Ji,et al.  Understanding image structure via hierarchical shape parsing , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Andreas Buja,et al.  Computing and graphics in statistics , 1992 .

[31]  Andrew W. Moore,et al.  X-means: Extending K-means with Efficient Estimation of the Number of Clusters , 2000, ICML.

[32]  Jiong Yang,et al.  STING: A Statistical Information Grid Approach to Spatial Data Mining , 1997, VLDB.

[33]  Mohammed J. Zaki Data Mining and Analysis: Fundamental Concepts and Algorithms , 2014 .

[34]  Danai Koutra,et al.  OPAvion: mining and visualization in large graphs , 2012, SIGMOD Conference.

[35]  Jos B. T. M. Roerdink,et al.  The Watershed Transform: Definitions, Algorithms and Parallelization Strategies , 2000, Fundam. Informaticae.

[36]  Robin Sibson,et al.  What is projection pursuit , 1987 .

[37]  Yousef Saad,et al.  Dense Subgraph Extraction with Application to Community Detection , 2012, IEEE Transactions on Knowledge and Data Engineering.

[38]  Krishna P. Gummadi,et al.  Measurement and analysis of online social networks , 2007, IMC '07.

[39]  Christos Faloutsos,et al.  RSC: Mining and Modeling Temporal Activity in Social Media , 2015, KDD.

[40]  Krzysztof Z. Gajos,et al.  Evaluation of Artery Visualizations for Heart Disease Diagnosis , 2011, IEEE Transactions on Visualization and Computer Graphics.

[41]  James R. Foulds,et al.  Collective Spammer Detection in Evolving Multi-Relational Social Networks , 2015, KDD.

[42]  Danai Koutra,et al.  Perseus: An Interactive Large-Scale Graph Mining and Visualization Tool , 2015, Proc. VLDB Endow..

[43]  Jure Leskovec,et al.  Image Labeling on a Network: Using Social-Network Metadata for Image Classification , 2012, ECCV.

[44]  Colin Ware,et al.  Color sequences for univariate maps: theory, experiments and principles , 1988, IEEE Computer Graphics and Applications.