Mining blackhole and volcano patterns in directed graphs: a general approach

Given a directed graph, the problem of blackhole mining is to identify groups of nodes, called blackhole patterns, in a way such that the average in-weight of this group is significantly larger than the average out-weight of the same group. The problem of finding volcano patterns is a dual problem of mining blackhole patterns. Therefore, we focus on discovering the blackhole patterns. Indeed, in this article, we develop a generalized blackhole mining framework. Specifically, we first design two pruning schemes for reducing the computational cost by reducing both the number of candidate patterns and the average computation cost for each candidate pattern. The first pruning scheme is to exploit the concept of combination dominance to reduce the exponential growth search space. Based on this pruning approach, we develop the gBlackhole algorithm. Instead, the second pruning scheme is an approximate approach, named approxBlackhole, which can strike a balance between the efficiency and the completeness of blackhole mining. Finally, experimental results on real-world data show that the performance of approxBlackhole can be several orders of magnitude faster than gBlackhole, and both of them have huge computational advantages over the brute-force approach. Also, we show that the blackhole mining algorithm can be used to capture some suspicious financial fraud patterns.

[1]  Thomas L. Griffiths,et al.  Probabilistic author-topic models for information discovery , 2004, KDD.

[2]  Jon M. Kleinberg,et al.  Overview of the 2003 KDD Cup , 2003, SKDD.

[3]  Christos Faloutsos,et al.  Sampling from large graphs , 2006, KDD '06.

[4]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD 2000.

[5]  Vipin Kumar,et al.  Feature bagging for outlier detection , 2005, KDD '05.

[6]  Hongyuan Zha,et al.  Probabilistic models for discovering e-communities , 2006, WWW '06.

[7]  George Karypis,et al.  Finding Frequent Patterns in a Large Sparse Graph* , 2005, Data Mining and Knowledge Discovery.

[8]  Donald E. Knuth,et al.  The Art of Computer Programming: Combinatorial Algorithms, Part 1 , 2011 .

[9]  Vic Barnett,et al.  Outliers in Statistical Data , 1980 .

[10]  Jimeng Sun,et al.  Neighborhood formation and anomaly detection in bipartite graphs , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[11]  Jure Leskovec,et al.  Signed networks in social media , 2010, CHI.

[12]  Daniel T. Larose,et al.  Discovering Knowledge in Data: An Introduction to Data Mining , 2005 .

[13]  Jure Leskovec,et al.  Predicting positive and negative links in online social networks , 2010, WWW '10.

[14]  Christos Faloutsos,et al.  LOCI: fast outlier detection using the local correlation integral , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[15]  Kristina Lerman,et al.  Community Detection Using a Measure of Global Influence , 2008, SNAKDD.

[16]  Xin-She Yang,et al.  Introduction to Algorithms , 2021, Nature-Inspired Optimization Algorithms.

[17]  Christos Faloutsos,et al.  oddball: Spotting Anomalies in Weighted Graphs , 2010, PAKDD.

[18]  Deepayan Chakrabarti,et al.  AutoPart: Parameter-Free Graph Partitioning and Outlier Detection , 2004, PKDD.

[19]  Bart Selman,et al.  Natural communities in large linked networks , 2003, KDD '03.

[20]  M E J Newman,et al.  Community structure in social and biological networks , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[21]  Pang-Ning Tan,et al.  Outrank: a Graph-Based Outlier Detection Framework Using Random Walk , 2008, Int. J. Artif. Intell. Tools.

[22]  Alexander S. Szalay,et al.  Very Fast Outlier Detection in Large Multidimensional Data Sets , 2002, DMKD.

[23]  Lada A. Adamic,et al.  On the Informational Properties of Trading Networks , 2009 .

[24]  Diane J. Cook,et al.  Graph-based anomaly detection , 2003, KDD '03.

[25]  A. Banerjee,et al.  Social Topic Models for Community Extraction , 2008 .

[26]  Chen Wang,et al.  Scalable mining of large disk-based graph databases , 2004, KDD.

[27]  Mong-Li Lee,et al.  A Partition-Based Approach to Graph Mining , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[28]  Jiawei Han,et al.  gSpan: graph-based substructure pattern mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[29]  Mark Newman,et al.  Detecting community structure in networks , 2004 .

[30]  Hui Xiong,et al.  Mining globally distributed frequent subgraphs in a single labeled graph , 2009, Data Knowl. Eng..

[31]  Hui Xiong,et al.  Detecting Blackhole and Volcano Patterns in Directed Networks , 2010, 2010 IEEE International Conference on Data Mining.

[32]  Jure Leskovec,et al.  Community Structure in Large Networks: Natural Cluster Sizes and the Absence of Large Well-Defined Clusters , 2008, Internet Math..

[33]  M E J Newman,et al.  Finding and evaluating community structure in networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[34]  Christos Faloutsos,et al.  Graphs over time: densification laws, shrinking diameters and possible explanations , 2005, KDD '05.

[35]  Kurt Mehlhorn,et al.  The LEDA Platform of Combinatorial and Geometric Computing , 1997, ICALP.

[36]  Richard A. Johnson,et al.  Applied Multivariate Statistical Analysis , 1983 .

[37]  Wei Wang,et al.  Efficient mining of frequent subgraphs in the presence of isomorphism , 2003, Third IEEE International Conference on Data Mining.

[38]  Lawrence B. Holder,et al.  Substructure Discovery Using Minimum Description Length and Background Knowledge , 1993, J. Artif. Intell. Res..

[39]  Douglas M. Hawkins Identification of Outliers , 1980, Monographs on Applied Probability and Statistics.

[40]  George Karypis,et al.  Finding Frequent Patterns in a Large Sparse Graph* , 2004, IEEE International Parallel and Distributed Processing Symposium.

[41]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.