论文信息 - ANG: a combination of Apriori and graph computing techniques for frequent itemsets mining

ANG: a combination of Apriori and graph computing techniques for frequent itemsets mining

The Apriori algorithm is one of the most well-known and widely accepted methods for the association rule mining. In Apriori, it uses a prefix tree to represent k-itemsets, generates k-itemset candidates based on the frequent ($$k-1$$k-1)-itemsets, and determines the frequent k-itemsets by traversing the prefix tree iteratively based on the transaction records. When k is small, the execution of Apriori is very efficient. However, the execution of Apriori could be very slow when k becomes large because of the deeper recursion depth to determine the frequent k-itemsets. From the perspective of graph computing, the transaction records can be converted to a graph $$G (V,\, E)$$G(V,E), where V is the set of vertices of G that represents the transaction records and E is the set of edges of G that represents the relations among transaction records. Each k-itemset in the transaction records will have a corresponding connected component in G. The number of vertices in the corresponding connected component is the support of the k-itemset. Since the time to find the corresponding connected component of a k-itemset in G is constant for any k, the graph computing method will be very efficient if the number of k-itemsets is relatively small. Based on Apriori and graph computing techniques, a hybrid method, called Apriori and Graph Computing (ANG), is proposed to compute the frequent itemsets. Initially, ANG uses Apriori to compute the frequent k-itemsets and then switches to the graph computing method when k becomes large (where the number of k-itemset candidates is relatively small). The experimental results show that ANG outperforms both Apriori and the graph computing method for all test cases.

Wenguang Chen | Rui Zhang | Yeh-Ching Chung | Hongji Yang | Tse-Chuan Hsu

[1] Yeh-Ching Chung,et al. An efficient hash-based method for discovering the maximal frequent set , 2001, 25th Annual International Computer Software and Applications Conference. COMPSAC 2001.

[2] Osman Hegazy,et al. AN EFFICIENT IMPLEMENTATION OF APRIORI ALGORITHM BASED ON HADOOP-MAPREDUCE MODEL , 2012 .

[3] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[4] Willy Zwaenepoel,et al. X-Stream: edge-centric graph processing using streaming partitions , 2013, SOSP.

[5] Guy E. Blelloch,et al. Smaller and Faster: Parallel Processing of Compressed Graphs with Ligra+ , 2015, 2015 Data Compression Conference.

[6] Ramakrishnan Srikant,et al. Fast algorithms for mining association rules , 1998, VLDB 1998.

[7] Joseph E. Gonzalez,et al. GraphLab: A New Parallel Framework for Machine Learning , 2010 .

[8] Lu Wang,et al. Fragment Re-Allocation Strategy Based on Hypergraph for NoSQL Database Systems , 2016, Int. J. Grid High Perform. Comput..

[9] Minsuk Kahng,et al. MMap: Fast billion-scale graph computation on a PC via memory mapping , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[10] Guy E. Blelloch,et al. GraphChi: Large-Scale Graph Computation on Just a PC , 2012, OSDI.

[11] Reynold Xin,et al. GraphX: Graph Processing in a Distributed Dataflow Framework , 2014, OSDI.

[12] Philip S. Yu,et al. An effective hash-based algorithm for mining association rules , 1995, SIGMOD '95.

[13] Kai Wang,et al. GraphQ: Graph Query Processing with Abstraction Refinement - Scalable and Programmable Analytics over Very Large Graphs on a Single PC , 2015, USENIX Annual Technical Conference.

[14] Wenguang Chen,et al. Gemini: A Computation-Centric Distributed Graph Processing System , 2016, OSDI.

[15] Theodore L. Willke,et al. GraphBuilder: scalable graph ETL framework , 2013, GRADES.

[16] Guy E. Blelloch,et al. Ligra: a lightweight graph processing framework for shared memory , 2013, PPoPP '13.

[17] V. Viswanathan,et al. Discovery of semantic associations in an RDF graph using bi-directional BFS on massively parallel hardware , 2016, Int. J. Big Data Intell..

[18] Reynold Xin,et al. GraphX: a resilient distributed graph system on Spark , 2013, GRADES.

[19] Binyu Zang,et al. PowerLyra: Differentiated Graph Computation and Partitioning on Skewed Graphs , 2019, TOPC.

[20] Fabio Pulvirenti,et al. Frequent Itemset Mining for Big Data , 2017 .

[21] Jinha Kim,et al. TurboGraph: a fast parallel graph engine handling billion-scale graphs in a single PC , 2013, KDD.

[22] Ming-Yen Lin,et al. Apriori-based frequent itemset mining algorithms on MapReduce , 2012, ICUIMC.

[23] R. B. V. Subramanyam,et al. Mining Interesting Infrequent Itemsets from Very Large Data based on MapReduce Framework , 2015 .

[24] Carlos Guestrin,et al. Distributed GraphLab : A Framework for Machine Learning and Data Mining in the Cloud , 2012 .

[25] Wenguang Chen,et al. GridGraph: Large-Scale Graph Processing on a Single Machine Using 2-Level Hierarchical Partitioning , 2015, USENIX ATC.

[26] Das Amrita,et al. Mining Association Rules between Sets of Items in Large Databases , 2013 .

[27] Christian Borgelt,et al. Induction of Association Rules: Apriori Implementation , 2002, COMPSTAT.

[28] Qing He,et al. Parallel Implementation of Apriori Algorithm Based on MapReduce , 2012, SNPD.

[29] He Zhang,et al. A Credible Cloud Service Model based on Behavior Graphs and Tripartite Decision-Making Mechanism , 2016, Int. J. Grid High Perform. Comput..

[30] Aart J. C. Bik,et al. Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[31] Joseph Gonzalez,et al. PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs , 2012, OSDI.

[32] Christian Borgelt,et al. EFFICIENT IMPLEMENTATIONS OF APRIORI AND ECLAT , 2003 .

[33] Kisung Lee,et al. Fast Iterative Graph Computation: A Path Centric Approach , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.