gIceberg: Towards iceberg analysis in large graphs

Traditional multi-dimensional data analysis techniques such as iceberg cube cannot be directly applied to graphs for finding interesting or anomalous vertices due to the lack of dimensionality in graphs. In this paper, we introduce the concept of graph icebergs that refer to vertices for which the concentration (aggregation) of an attribute in their vicinities is abnormally high. Intuitively, these vertices shall be “close” to the attribute of interest in the graph space. Based on this intuition, we propose a novel framework, called gIceberg, which performs aggregation using random walks, rather than traditional SUM and AVG aggregate functions. This proposed framework scores vertices by their different levels of interestingness and finds important vertices that meet a user-specified threshold. To improve scalability, two aggregation strategies, forward and backward aggregation, are proposed with corresponding optimization techniques and bounds. Experiments on both real-world and synthetic large graphs demonstrate that gIceberg is effective and scalable.

[1]  Ambuj K. Singh,et al.  Mining Heavy Subgraphs in Time-Evolving Networks , 2011, 2011 IEEE 11th International Conference on Data Mining.

[2]  Philip S. Yu,et al.  Graph OLAP: a multi-dimensional framework for graph data analysis , 2009, Knowledge and Information Systems.

[3]  Christos Faloutsos,et al.  R-MAT: A Recursive Model for Graph Mining , 2004, SDM.

[4]  Kurt Mehlhorn,et al.  LEDA: a platform for combinatorial and geometric computing , 1997, CACM.

[5]  Jiawei Han,et al.  Top-K aggregation queries over large networks , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[6]  Jiawei Han,et al.  Star-Cubing: Computing Iceberg Cubes by Top-Down and Bottom-Up Integration , 2003, Very Large Data Bases Conference.

[7]  RamakrishnanRaghu,et al.  Bottom-up computation of sparse and Iceberg CUBE , 1999 .

[8]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[9]  Jimeng Sun,et al.  Neighborhood formation and anomaly detection in bipartite graphs , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[10]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[11]  Shang-Hua Teng,et al.  A Local Clustering Algorithm for Massive Graphs and Its Application to Nearly Linear Time Graph Partitioning , 2008, SIAM J. Comput..

[12]  Jiawei Han,et al.  gSpan: graph-based substructure pattern mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[13]  Rajeev Motwani,et al.  Computing Iceberg Queries Efficiently , 1998, VLDB.

[14]  Massimo Franceschet,et al.  PageRank , 2010, Commun. ACM.

[15]  Jennifer Neville,et al.  Randomization tests for distinguishing social influence and homophily effects , 2010, WWW '10.

[16]  Purnamrita Sarkar,et al.  Fast nearest-neighbor search in disk-resident graphs , 2010, KDD.

[17]  Fan Chung Graham,et al.  Local Partitioning for Directed Graphs Using PageRank , 2007, Internet Math..

[18]  Samir Khuller,et al.  On Finding Dense Subgraphs , 2009, ICALP.

[19]  Jennifer Widom,et al.  Scaling personalized web search , 2003, WWW '03.

[20]  Jiawei Han,et al.  Mining scale-free networks using geodesic clustering , 2004, KDD.

[21]  Martin Ester,et al.  Mining Cohesive Patterns from Graphs with Feature Vectors , 2009, SDM.

[22]  Kun-Lung Wu,et al.  Towards proximity pattern mining in large graphs , 2010, SIGMOD Conference.

[23]  Dániel Fogaras,et al.  Towards Scaling Fully Personalized PageRank: Algorithms, Lower Bounds, and Experiments , 2005, Internet Math..

[24]  Aditya Bhaskara,et al.  Detecting high log-densities: an O(n¼) approximation for densest k-subgraph , 2010, STOC '10.

[25]  Ashish Goel,et al.  Fast Incremental and Personalized PageRank , 2010, Proc. VLDB Endow..

[26]  Michalis Faloutsos,et al.  BGP-lens: patterns and anomalies in internet routing updates , 2009, KDD.

[27]  Christos Faloutsos,et al.  oddball: Spotting Anomalies in Weighted Graphs , 2010, PAKDD.

[28]  Jignesh M. Patel,et al.  Efficient aggregation for graph summarization , 2008, SIGMOD Conference.

[29]  Pavel Berkhin,et al.  Bookmark-Coloring Algorithm for Personalized PageRank Computing , 2006, Internet Math..

[30]  Moses Charikar,et al.  Greedy approximation algorithms for finding dense components in a graph , 2000, APPROX.

[31]  Raghu Ramakrishnan,et al.  Bottom-up computation of sparse and Iceberg CUBE , 1999, SIGMOD '99.

[32]  Diane J. Cook,et al.  Graph-based anomaly detection , 2003, KDD '03.