YAFIM: A Parallel Frequent Itemset Mining Algorithm with Spark
The frequent itemset mining (FIM) is one of the most important techniques to extract knowledge from data in many real-world applications. The Apriori algorithm is the widely-used algorithm for mining frequent itemsets from a transactional dataset. However, the FIM process is both data-intensive and computing-intensive. On one side, large scale data sets are usually adopted in data mining nowadays, on the other side, in order to generate valid information, the algorithm needs to scan the datasets iteratively for many times. These make the FIM algorithm very time-consuming over big data. The parallel and distributed computing is effective and mostly-used strategy for speeding up large scale dataset algorithms. However, the existing parallel Apriori algorithms implemented with the MapReduce model are not efficient enough for iterative computation. In this paper, we proposed YAFIM (Yet Another Frequent Itemset Mining), a parallel Apriori algorithm based on the Spark RDD framework -- a specially-designed in-memory parallel computing model to support iterative algorithms and interactive data mining. Experimental results show that, compared with the algorithms implemented with MapReduce, YAFIM achieved 18× speedup in average for various benchmarks. Especially, we apply YAFIM in a real-world medical application to explore the relationships in medicine. It outperforms the MapReduce method around 25 times.
FiDoop: Parallel Mining of Frequent Itemsets Using MapReduce
Existing parallel mining algorithms for frequent itemsets lack a mechanism that enables automatic parallelization, load balancing, data distribution, and fault tolerance on large clusters. As a solution to this problem, we design a parallel frequent itemsets mining algorithm called FiDoop using the MapReduce programming model. To achieve compressed storage and avoid building conditional pattern bases, FiDoop incorporates the frequent items ultrametric tree, rather than conventional FP trees. In FiDoop, three MapReduce jobs are implemented to complete the mining task. In the crucial third MapReduce job, the mappers independently decompose itemsets, the reducers perform combination operations by constructing small ultrametric trees, and the actual mining of these trees separately. We implement FiDoop on our in-house Hadoop cluster. We show that FiDoop on the cluster is sensitive to data distribution and dimensions, because itemsets with different lengths have different decomposition and construction costs. To improve FiDoop's performance, we develop a workload balance metric to measure load balance across the cluster's computing nodes. We develop FiDoop-HD, an extension of FiDoop, to speed up the mining performance for high-dimensional data analysis. Extensive experiments using real-world celestial spectral data demonstrate that our proposed solution is efficient and scalable.
FiDoop-DP: Data Partitioning in Frequent Itemset Mining on Hadoop Clusters
Traditional parallel algorithms for mining frequent itemsets aim to balance load by equally partitioning data among a group of computing nodes. We start this study by discovering a serious performance problem of the existing parallel Frequent Itemset Mining algorithms. Given a large dataset, data partitioning strategies in the existing solutions suffer high communication and mining overhead induced by redundant transactions transmitted among computing nodes. We address this problem by developing a data partitioning approach called FiDoop-DP using the MapReduce programming model. The overarching goal of FiDoop-DP is to boost the performance of parallel Frequent Itemset Mining on Hadoop clusters. At the heart of FiDoop-DP is the Voronoi diagram-based data partitioning technique, which exploits correlations among transactions. Incorporating the similarity metric and the Locality-Sensitive Hashing technique, FiDoop-DP places highly similar transactions into a data partition to improve locality without creating an excessive number of redundant transactions. We implement FiDoop-DP on a 24-node Hadoop cluster, driven by a wide range of datasets created by IBM Quest Market-Basket Synthetic Data Generator. Experimental results reveal that FiDoop-DP is conducive to reducing network and computing loads by the virtue of eliminating redundant transactions on Hadoop nodes. FiDoop-DP significantly improves the performance of the existing parallel frequent-pattern scheme by up to 31 percent with an average of 18 percent.
Accelerating frequent itemset mining on graphics processing units
In this paper we describe a new parallel Frequent Itemset Mining algorithm called “Frontier Expansion.” This implementation is optimized to achieve high performance on a heterogeneous platform consisting of a shared memory multiprocessor and multiple Graphics Processing Unit (GPU) coprocessors. Frontier Expansion is an improved data-parallel algorithm derived from the Equivalent Class Clustering (Eclat) method, in which a partial breadth-first search is utilized to exploit maximum parallelism while being constrained by the available memory capacity. In our approach, the vertical transaction lists are represented using a “bitset” representation and operated using wide bitwise operations across multiple threads on a GPU. We evaluate our approach using four NVIDIA Tesla GPUs and observed a 6–30× speedup relative to state-of-the-art sequential Eclat and FPGrowth implementations executed on a multicore CPU.
Parallel frequent itemset mining using systolic arrays
Since extraction of frequent itemsets from a transaction database is crucial to several data mining tasks such as association rule generation, so frequent itemset mining is one of the most important concepts in data mining. One of the major problems in frequent itemset mining is the explosion of the number of results which is directly effecting on the execution time of itemset mining algorithms. To address this problem, closed itemsets have been proposed, which provides concise lossless representations of the original collection of frequent itemsets. Henceforth, the frequencies of all itemsets in the original collection can be reconstructed from the reduced collection. However, the reduction provided by this exact method is not sufficient to solve the pattern explosion problem, mainly because of high dimensional datasets which have large number of items in each transaction. Colossal itemset mining is another solution to reduce the output size which will not be useful if the set of all frequent itemsets have been required. Higher level of performance improvement can be obtained from efficient scalable parallel mining methods. In this paper we represent an efficient scalable parallel algorithm using systolic arrays to conduct mining of frequent itemsets in very large, such as high dimensional, datasets. In our algorithm, we use a bit matrix to compress the dataset and mapping the mining algorithm on the systolic arrays architecture. For this purpose, each transaction of dataset represents as a row in the bit matrix. We use this bit matrix structure to model the pattern mining as a systolic array problem. Our experimental results and performance study show that this algorithm outperforms substantially the best previously developed parallel algorithms.
genetic algorithm data mining big datum power consumption data structure association rule data stream programmable gate array field programmable gate elliptic curve data mining technique efficient algorithm smart card fpga implementation association rule mining mining algorithm power analysi frequent itemset hyperspectral datum sliding window frequent pattern leaf area apriori algorithm mining association rule leaf area index side channel uncertain datum differentially private leakage power algorithmic approach mining association elliptic curve cryptosystem mining frequent itemset mining curve cryptosystem frequent itemset mining plant leaf power analysis attack differential power analysi item set data mining task data stream mining frequent item analysis attack differential power high utility stream mining mining frequent itemset chlorophyll content maximal frequent mining frequent pattern false negative data mining problem high utility itemset frequent closed frequent itemsets mining utility itemset association mining closed itemset chlorophyll fluorescence itemsets mining transactional datum efficient mining correlation power analysi side channel analysi dpa attack maximal frequent itemset frequent closed itemset mining problem mining maximal frequent mining data stream closed frequent mining maximal itemset mining algorithm simple power analysi mining frequent closed leaf chlorophyll content memory consumption leaf chlorophyll finding frequent closed frequent itemset maximum frequent discovering frequent koblitz curve weighted frequent mining closed estimating leaf vegetative growth cryptographic circuit fast mining airborne spectrographic imager chlorophyll meter compact airborne spectrographic finding frequent itemset mining closed frequent top-k frequent estimation of leaf leakage power analysi discovering frequent itemset transactional data stream parallel frequent weighted frequent itemset prosail model discovery of association approximate frequent mining top-k frequent parallel frequent itemset itemset mining problem probabilistic frequent itemset number of transactions frequent itemsets algorithm find frequent itemset