Efficiently extracting frequent subgraphs using MapReduce

Frequent subgraph extraction from a large number of small graphs is a primitive operation for many data mining applications. To extract frequent subgraphs, existing techniques need to enumerate a large number of subgraphs which is superlinear with the cardinality of the dataset. Given the rapid growing volume of graph data, it is difficult to perform the frequent subgraph extraction on a centralized machine efficiently. In this paper, we investigate how to efficiently perform this extraction over very large datasets using MapReduce. Parallelizing existing techniques directly using MapReduce does not yield good performance as it is difficult to balance the workload among the compute nodes. We therefore propose a framework that adopts the breadth first search strategy to iteratively extract frequent subgraphs, i.e., all frequent size-(i+1) subgraphs are generated based on frequent size-i subgraphs at the ith iteration using a single MapReduce job. To efficiently extract frequent subgraphs, we propose an isomorphism-testing-free approach by properly maintaining how frequent subgraphs are mapped within each graph. Extensive experiments conducted on our in-house clusters demonstrate the superiority of our proposed solution in comparison with the baseline approach.

[1]  Kenneth H. Rosen,et al.  Discrete Mathematics and its applications , 2000 .

[2]  Chen Li,et al.  Efficient parallel set-similarity joins using MapReduce , 2010, SIGMOD Conference.

[3]  Christian Borgelt,et al.  Mining molecular fragments: finding relevant substructures of molecules , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[4]  Joseph M. Hellerstein,et al.  Distributed GraphLab: A Framework for Machine Learning in the Cloud , 2012, Proc. VLDB Endow..

[5]  Jiong Yang,et al.  SPIN: mining maximal frequent subgraphs from graph databases , 2004, KDD.

[6]  Mirek Riedewald,et al.  Processing theta-joins using MapReduce , 2011, SIGMOD '11.

[7]  George Karypis,et al.  Frequent subgraph discovery , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[8]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[9]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[10]  Wilfred Ng,et al.  Efficient query processing on graph databases , 2009, TODS.

[11]  Jiawei Han,et al.  gSpan: graph-based substructure pattern mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[12]  Carlos Guestrin,et al.  Distributed GraphLab : A Framework for Machine Learning and Data Mining in the Cloud , 2012 .

[13]  Beng Chin Ooi,et al.  Query optimization for massively parallel data processing , 2011, SoCC.

[14]  Philip S. Yu,et al.  Graph indexing based on discriminative frequent structure analysis , 2005, TODS.

[15]  Beng Chin Ooi,et al.  Efficient Processing of k Nearest Neighbor Joins using MapReduce , 2012, Proc. VLDB Endow..

[16]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[17]  Takashi Washio,et al.  An Apriori-Based Algorithm for Mining Frequent Substructures from Graph Data , 2000, PKDD.

[18]  Wei Wang,et al.  Efficient mining of frequent subgraphs in the presence of isomorphism , 2003, Third IEEE International Conference on Data Mining.

[19]  Philip S. Yu,et al.  Graph indexing: a frequent structure-based approach , 2004, SIGMOD '04.

[20]  Chen Wang,et al.  Scalable mining of large disk-based graph databases , 2004, KDD.

[21]  Wilfred Ng,et al.  Fg-index: towards verification-free query processing on graph databases , 2007, SIGMOD '07.

[22]  Joost N. Kok,et al.  A quickstart in frequent structure mining can make a difference , 2004, KDD.