论文信息 - A pure array structure and parallel strategy for high-utility sequential pattern mining

A pure array structure and parallel strategy for high-utility sequential pattern mining

Abstract High-utility sequential pattern mining (HUSPM) is the task of discovering all sequential patterns in a sequence database whose utility values are equal to or greater than a given minimum utility threshold. HUSPM has become increasingly important in many real-world data mining applications, such as market basket data analysis, weblog mining, and bio-medical gene data analysis, which considers co-occurrence values and quantity, utility (e.g., profit or cost) and time. Current approaches in the literature for HUSPM use the utility matrix to store a sequence database in the memory. Unfortunately, the utility matrix consumes a large amount of main memory. To address this issue, we introduce a pure array structure that reduces the memory consumption when compared to the utility matrix. In addition, HUSPM is also challenged with the downward closure property (DCP) to prune the search space. Recently, HUSPM algorithms have used the upper bound of utility values as the DCP. However, it is usually higher than the actual utility of patterns. Thus, these algorithms may generate many candidate patterns. The large search space leads to poor performance due to excessive runtime and memory usage. One of the reasons is the number of candidate patterns is proportional to the number of requisite projected database scans for calculating their actual utilities. In this paper, we present a novel pruning strategy that efficiently prunes non-HUSPs and significantly reduces the search space compared to the state-of-the-art HUS-Span algorithm. Moreover, we propose a parallel strategy to speed up the mining process. Then, we propose two algorithms which are the pure Array structure for High-utility Sequential (AHUS) pattern mining and AHUS parallel mining (AHUS-P). The AHUS-P algorithm uses shared memory to parallelize the mining process. It concurrently identifies HUSPs based on the advantages of the multi-core processor architecture. The experimental results show that AHUS and AHUS-P can efficiently and effectively discover all HUSPs. Both the proposed algorithms outperform the state-of-the-art HUS-Span algorithm in terms of runtime, memory usage, and scalability for all experimental datasets.

[1] Johannes Gehrke,et al. Sequential PAttern mining using a bitmap representation , 2002, KDD.

[2] Yu Liu,et al. BAHUI: Fast and Memory Efficient Mining of High Utility Itemsets Based on Bitmap , 2014, Int. J. Data Warehous. Min..

[3] Srinivasan Parthasarathy,et al. New parallel algorithms for frequent itemset mining in very large databases , 2003, Proceedings. 15th Symposium on Computer Architecture and High Performance Computing.

[4] Eric Li,et al. Optimization of Frequent Itemset Mining on Multiple-Core Processor , 2007, VLDB.

[5] Jianyong Wang,et al. Mining sequential patterns by pattern-growth: the PrefixSpan approach , 2004, IEEE Transactions on Knowledge and Data Engineering.

[6] Philip S. Yu,et al. Efficient Algorithms for Mining High Utility Itemsets from Transactional Databases , 2013, IEEE Transactions on Knowledge and Data Engineering.

[7] Unil Yun,et al. A new framework for detecting weighted sequential patterns in large sequence databases , 2008, Knowl. Based Syst..

[8] Cory J. Butz,et al. A Foundational Approach to Mining Itemset Utilities from Databases , 2004, SDM.

[9] Antonio Gomariz,et al. The SPMF Open-Source Data Mining Library Version 2 , 2016, ECML/PKDD.

[10] Ramakrishnan Srikant,et al. Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[11] Mengchi Liu,et al. Mining high utility itemsets without candidate generation , 2012, CIKM.

[12] Byeong-Soo Jeong,et al. A Novel Approach for Mining High‐Utility Sequential Patterns in Sequence Databases , 2010 .

[13] Tzung-Pei Hong,et al. Applying the maximum utility measure in high utility sequential pattern mining , 2014, Expert Syst. Appl..

[14] Hoai Bac Le,et al. An Approach to Decrease Execution Time and Difference for Hiding High Utility Sequential Patterns , 2016, IUKM.

[15] Hoai Bac Le,et al. A Novel Approach for Hiding High Utility Sequential Patterns , 2015, SoICT.

[16] Irina Gorbach,et al. Microsoft SQL Server 2008 Analysis Services Unleashed , 2008 .

[17] Kyuseok Shim,et al. SQUIRE: Sequential pattern mining with quantities , 2007, J. Syst. Softw..

[18] Longbing Cao,et al. USpan: an efficient algorithm for mining high utility sequential patterns , 2012, KDD.

[19] Ramakrishnan Srikant,et al. Mining sequential patterns , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[20] Weimin Zheng,et al. OCEAN: Fast Discovery of High Utility Occupancy Itemsets , 2016, PAKDD.

[21] Yi-Cheng Chen,et al. On efficiently mining high utility sequential patterns , 2016, Knowledge and Information Systems.

[22] Jian Pei,et al. Mining sequential patterns with constraints in large databases , 2002, CIKM '02.

[23] Van-Nam Huynh,et al. Mining Periodic High Utility Sequential Patterns , 2017, ACIIDS.

[24] Fan Zhang,et al. Accelerating frequent itemset mining on graphics processing units , 2013, The Journal of Supercomputing.

[25] Hoai Bac Le,et al. MHHUSP: An integrated algorithm for mining and Hiding High Utility Sequential Patterns , 2016, 2016 Eighth International Conference on Knowledge and Systems Engineering (KSE).

[26] Ying Liu,et al. A Two-Phase Algorithm for Fast Discovery of High Utility Itemsets , 2005, PAKDD.

[27] Srinivasan Parthasarathy,et al. Parallel Data Mining for Association Rules on Shared-Memory Multi-Processors , 1996, Proceedings of the 1996 ACM/IEEE Conference on Supercomputing.

[28] Keun Ho Ryu,et al. High utility itemset mining with techniques for reducing overestimated utilities and pruning candidates , 2014, Expert Syst. Appl..

[29] Maya Joshi,et al. MINING HIGH UTILITY ITEMSET USING GRAPHICS PROCESSOR , 2016 .

[30] Van-Nam Huynh,et al. An efficient algorithm for Hiding High Utility Sequential Patterns , 2018, Int. J. Approx. Reason..

[31] Vincent S. Tseng,et al. EFIM: A Highly Efficient Algorithm for High-Utility Itemset Mining , 2015, MICAI.

[32] Young-Koo Lee,et al. Efficient Tree Structures for High Utility Pattern Mining in Incremental Databases , 2009, IEEE Transactions on Knowledge and Data Engineering.