Real-time stream data mining based on CanTree and Gtree

Proposed algorithm discovers complete frequent itemsets from the stream data.It uses CanTree to store transactions and has an efficient algorithm for sliding-windows.GTree is proposed to find frequent itemsets and serves as a projection-tree.GTree uses a top-down tree traversal and includes pruning of infrequent items.Combination of CanTree and GTree reduces the data mining cost significantly. We face an increasing need to discover knowledge from data streams in real-time. Real-time stream data mining needs a compact data structure to store transactions in the recent sliding-window by one scan, and an efficient algorithm to discover frequent itemsets from the compact data structure. In this paper, we propose a novel data mining algorithm, called CanTree-GTree, which discovers the complete frequent itemsets from real-time transactions based on sliding-windows. The algorithm uses two data structures: CanTree and GTree. CanTree compactly represents all transactions in a sliding-window by one scan, and serves as a base-tree. The algorithm efficiently maintains the base-tree by adding new transactions and removing old transactions without any reconstruction phases. A novel data structure, called GTree (Group Tree), serves as a projection-tree for each data item. The algorithm traverses each node of the base-tree only once by using a top-down tree traversal method to build the projection-tree, and discovers frequent itemsets by low processing cost. The proposed algorithm is therefore effective for discovering frequent itemsets in real-time stream data. Our performance evaluation experiments with other algorithms based on CPSTree and CanTree-FPTree show that our algorithm outperforms the other algorithms in the synthetic data set by about 35% and 26% of run-time cost, respectively. Also, we confirm that the proposed algorithm shows excellent results on real-world data sets.

[1]  Dennis Shasha,et al.  StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time , 2002, VLDB.

[2]  沈錳坤 An Efficient Algorithm for Mining Frequent Itemsets over the Entire History of Data Streams , 2004 .

[3]  Suh-Yin Lee,et al.  Mining frequent itemsets over data streams using efficient window sliding techniques , 2009, Expert Syst. Appl..

[4]  Zhan Li,et al.  Knowledge and Information Systems , 2007 .

[5]  Suh-Yin Lee,et al.  An Efficient Algorithm for Mining Frequent Itemests over the Entire History of Data Streams , 2004 .

[6]  Das Amrita,et al.  Mining Association Rules between Sets of Items in Large Databases , 2013 .

[7]  Alfredo Cuzzocrea,et al.  On Managing Very Large Sensor-Network Data Using Bigtable , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[8]  Xindong Wu,et al.  Mining maximal frequent itemsets from data streams , 2007, J. Inf. Sci..

[9]  Donato Malerba,et al.  A parallel algorithm for approximate frequent itemset mining using MapReduce , 2014, 2014 International Conference on High Performance Computing & Simulation (HPCS).

[10]  Tzung-Pei Hong,et al.  An Efficient FUFP-tree Maintenance Algorithm for Record Modification , 2008 .

[11]  Carson Kai-Sang Leung,et al.  Efficient Mining of Frequent Itemsets from Data Streams , 2008, BNCOD.

[12]  Svetha Venkatesh,et al.  Anomaly detection in large-scale data stream networks , 2012, Data Mining and Knowledge Discovery.

[13]  Arbee L. P. Chen,et al.  Mining Frequent Itemsets from Data Streams with a Time-Sensitive Sliding Window , 2005, SDM.

[14]  Eyke Hüllermeier,et al.  Open challenges for data stream mining research , 2014, SKDD.

[15]  Won Suk Lee,et al.  estWin: Online data stream mining of recent frequent itemsets by sliding window method , 2005, J. Inf. Sci..

[16]  Won Suk Lee,et al.  A Sliding Window Method for Finding Recently Frequent Itemsets over Online Data Streams , 2004, J. Inf. Sci. Eng..

[17]  Carson Kai-Sang Leung,et al.  DSTree: A Tree Structure for the Mining of Frequent Sets from Data Streams , 2006, Sixth International Conference on Data Mining (ICDM'06).

[18]  Abraham Kandel,et al.  Real-time data mining of non-stationary data streams from sensor networks , 2008, Inf. Fusion.

[19]  Philip S. Yu,et al.  Mining Frequent Patterns in Data Streams at Multiple Time Granularities , 2002 .

[20]  Ning Zhang,et al.  A Simple but Effective Maximal Frequent Itemset Mining Algorithm over Streams , 2012, J. Softw..

[21]  Wang Ben-nian Frequent Pattern Mining in Data Streams , 2007 .

[22]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[23]  Alfredo Cuzzocrea,et al.  Mining constrained frequent itemsets from distributed uncertain data , 2014, Future Gener. Comput. Syst..

[24]  Keun Ho Ryu,et al.  Mining maximal frequent patterns by considering weight conditions over data streams , 2014, Knowl. Based Syst..

[25]  Vikas Kumar,et al.  A novel technique for mining closed frequent itemsets using variable sliding window , 2014, 2014 IEEE International Advance Computing Conference (IACC).

[26]  Philip S. Yu,et al.  Catch the moment: maintaining closed frequent itemsets over a data stream sliding window , 2006, Knowledge and Information Systems.

[27]  Dan Schonfeld,et al.  Real-Time Motion Trajectory-Based Indexing and Retrieval of Video Sequences , 2007, IEEE Transactions on Multimedia.

[28]  Liang Tang,et al.  MovStream: An efficient algorithm for monitoring clusters evolving in data streams , 2008, 2008 IEEE International Conference on Granular Computing.

[29]  Li Shen,et al.  New Algorithms for Efficient Mining of Association Rules , 1999, Inf. Sci..

[30]  Ming-Syan Chen,et al.  Sliding window filtering: an efficient method for incremental mining on a time-variant database , 2005, Inf. Syst..

[31]  Alfredo Cuzzocrea,et al.  Discovering Frequent Patterns from Uncertain Data Streams with Time-Fading and Landmark Models , 2013, Trans. Large Scale Data Knowl. Centered Syst..

[32]  Young-Koo Lee,et al.  Sliding window-based frequent pattern mining over data streams , 2009, Inf. Sci..

[33]  Jian Xu,et al.  Real time contextual collective anomaly detection over multiple data streams , 2014 .

[34]  Ping-Yu Hsu,et al.  Algorithms for mining association rules in bag databases , 2004, Inf. Sci..

[35]  Hongjun Lu,et al.  False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams , 2004, VLDB.

[36]  Darshan Tank Real-Time Business Intelligence & Frequent Pattern Mining Algorithm: Timely Consistent Analysis Using Real-Time Data Warehouse Environment and Improving Efficiency of Apriori Algorithm , 2012 .

[37]  Tzung-Pei Hong,et al.  Incrementally fast updated frequent pattern trees , 2008, Expert Syst. Appl..

[38]  K. Swarupa Rani,et al.  Distributed Methodology of CanTree Construction , 2011, MIWAI.

[39]  Nan Jiang,et al.  Research issues in data stream association rule mining , 2006, SGMD.

[41]  Mohammed J. Zaki,et al.  Efficient algorithms for mining closed itemsets and their lattice structure , 2005, IEEE Transactions on Knowledge and Data Engineering.

[42]  Carson Kai-Sang Leung,et al.  CanTree: a tree structure for efficient incremental mining of frequent patterns , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[43]  Hong Chen,et al.  An Efficient Algorithm for Frequent Itemset Mining on Data Streams , 2006, Industrial Conference on Data Mining.