论文信息 - Stream Sequential Pattern Mining with Precise Error Bounds

Stream Sequential Pattern Mining with Precise Error Bounds

Sequential pattern mining is an interesting data mining problem with many real-world applications. This problem has been studied extensively in static databases. However, in recent years, emerging applications have introduced a new form of data called data stream. In a data stream, new elements are generated continuously. This poses additional constraints on the methods used for mining such data: memory usage is restricted, the infinitely flowing original dataset cannot be scanned multiple times, and current results should be available on demand.This paper introduces two effective methods for mining sequential patterns from data streams: the SS-BE method and the SS-MB method. The proposed methods break the stream into batches and only process each batch once. The two methods use different pruning strategies that restrict the memory usage but can still guarantee that all true sequential patterns are output at the end of any batch. Both algorithms scale linearly in execution time as the number of sequences grows, making them effective methods for sequential pattern mining in data streams. The experimental results also show that our methods are very accurate in that only a small fraction of the patterns that are output are false positives. Even for these false positives, SS-BE guarantees that their true support is above a pre-defined threshold.

Jiawei Han | Bolin Ding | Luiz F. Mendes

[1] Philip S. Yu,et al. On demand classification of data streams , 2004, KDD.

[2] Sudipto Guha,et al. Clustering Data Streams , 2000, FOCS.

[3] Christie I. Ezeife,et al. SSM : A Frequent Sequential Data Stream Patterns Miner , 2007, 2007 IEEE Symposium on Computational Intelligence and Data Mining.

[4] Philip S. Yu,et al. A Framework for Clustering Evolving Data Streams , 2003, VLDB.

[5] Rajeev Motwani,et al. Approximate Frequency Counts over Data Streams , 2012, VLDB.

[6] Richard M. Karp,et al. A simple algorithm for finding frequent elements in streams and bags , 2003, TODS.

[7] Philip S. Yu,et al. Mining concept-drifting data streams using ensemble classifiers , 2003, KDD '03.

[8] Divyakant Agrawal,et al. Efficient Computation of Frequent and Top-k Elements in Data Streams , 2005, ICDT.

[9] Hong Chen,et al. GraSeq : A Novel Approximate Mining Approach of Sequential Patterns over Data Stream , 2007, ADMA.

[10] Qiming Chen,et al. PrefixSpan,: mining sequential patterns efficiently by prefix-projected pattern growth , 2001, Proceedings 17th International Conference on Data Engineering.

[11] Florent Masseglia,et al. Mining Sequential Patterns from Temporal Streaming Data , 2005 .