A Combination of Prefixed-Itemset and Database Optimization to Improve Apriori Algorithm on Hadoop Cluster

Apriori algorithm is a classical algorithm in the field of data mining. It is widely used in the research of mining the association rules, but it also has some disadvantages. In this paper, for the three main steps in the execution of the Apriori algorithm, we propose a novel method that combines the storage structure of the prefixed-itemset with the database optimization to improve the Apriori algorithm on the Hadoop cluster. First, we used the storage structure of the prefixed-itemset to improve the implementation methods of the connection step and the pruning step in the traditional Apriori algorithm to increase the execution efficiency of the algorithm. Second, we changed the storage schema of the database. And we converted the original transaction database into the transaction-state matrix to transform the storage pattern of transaction data and enhance the efficiency of the traditional Apriori algorithm in the counting step. Then, we combined the properties of the frequent itemsets to improve the iterative termination condition of the algorithm, thus reduced the running time of the algorithm. Finally, we performed MapReduce parallelization improvement on the Apriori algorithm optimized by the above steps based on the Hadoop distributed architecture. The experimental results show that compared with the traditional Apriori algorithm, the improved Apriori algorithm on the Hadoop cluster has improved the execution efficiency greatly.