Mining Approximate Frequent Itemsets In the Presence of Noise: Algorithm and Analysis
Frequent itemset mining is a popular and important first step in the analysis of data arising in a broad range of applications. The traditional “exact” model for frequent itemsets requires that every item occur in each supporting transaction. However, real data is typically subject to noise and measurement error. To date, the effect of noise on exact frequent pattern mining algorithms have been addressed primarily through simulation studies, and there has been limited attention to the development of noise tolerant algorithms. In this paper we propose a noise tolerant itemset model, which we call approximate frequent itemsets (AFI). Like frequent itemsets, the AFI model requires that an itemset has a minimum number of supporting transactions. However, the AFI model tolerates a controlled fraction of errors in each item and each supporting transaction. Motivating this model are theoretical results (and a supporting simulation study presented here) which state that, in the presence of even low levels of noise, large frequent itemsets are broken into fragments of logarithmic size; thus the itemsets cannot be recovered by a routine application of frequent itemset mining. By contrast, we provide theoretical results showing that the AFI criterion is well suited to recovery of block structures subject to noise. We developed and implemented an algorithm to mine AFIs that generalizes the level-wise enumeration of frequent itemsets by allowing noise. We propose the noise-tolerant support threshold, a relaxed version of support, which varies with the length of the itemset and the noise threshold. We exhibit an Apriori property that permits the pruning of an itemset if any of its sub-itemset is not sufficiently supported. Several experiments presented demonstrate that the AFI algorithm enables better recoverability of frequent patterns under noisy conditions than existing frequent itemset mining approaches. Noise-tolerant support pruning also renders an order of magnitude performance gain over existing methods.
Approximate Frequent Itemset Mining In the Presence of Random Noise
Frequent itemset mining has been a focused theme in data mining research and an important first step in the analysis of data arising in a broad range of applications. The traditional exact model for frequent itemset requires that every item occur in each supporting transaction. However, real application data is usually subject to random noise or measurement error, which poses new challenges for the efficient discovery of frequent itemset from the noisy data. Mining approximate frequent itemset in the presence of noise involves two key issues: the definition of a noise-tolerant mining model and the design of an efficient mining algorithm. In this chapter, we will give an overview of the approximate itemset mining algorithms in the presence of random noise and examine several noise-tolerant mining approaches.
genetic algorithm data mining big datum power consumption data structure association rule data stream programmable gate array field programmable gate elliptic curve data mining technique efficient algorithm smart card fpga implementation association rule mining mining algorithm power analysi frequent itemset hyperspectral datum sliding window frequent pattern leaf area apriori algorithm mining association rule leaf area index side channel uncertain datum differentially private leakage power algorithmic approach mining association elliptic curve cryptosystem mining frequent itemset mining curve cryptosystem frequent itemset mining plant leaf power analysis attack differential power analysi item set data mining task data stream mining frequent item analysis attack differential power high utility stream mining mining frequent itemset chlorophyll content maximal frequent mining frequent pattern false negative data mining problem high utility itemset frequent closed frequent itemsets mining utility itemset association mining closed itemset chlorophyll fluorescence itemsets mining transactional datum efficient mining correlation power analysi side channel analysi dpa attack maximal frequent itemset frequent closed itemset mining problem mining maximal frequent mining data stream closed frequent mining maximal itemset mining algorithm simple power analysi mining frequent closed leaf chlorophyll content memory consumption leaf chlorophyll finding frequent closed frequent itemset maximum frequent discovering frequent koblitz curve weighted frequent mining closed estimating leaf vegetative growth cryptographic circuit fast mining airborne spectrographic imager chlorophyll meter compact airborne spectrographic finding frequent itemset mining closed frequent top-k frequent estimation of leaf leakage power analysi discovering frequent itemset transactional data stream parallel frequent weighted frequent itemset prosail model discovery of association approximate frequent mining top-k frequent parallel frequent itemset itemset mining problem probabilistic frequent itemset number of transactions frequent itemsets algorithm find frequent itemset