A Test Paradigm for Detecting Changes in Transactional Data Streams

A pattern is considered useful if it can be used to help a person to achieve his goal. Mining data streams for useful patterns is important in many applications. However, data stream can change their behavior over time and, when significant change occurs, much harm is done to the mining result if it is not properly handled. In the past, there have been many studies mainly on adapting to changes in data streams.We contend that adapting to changes is simply not enough. The ability to detect and characterize change is also essential in many applications, for example intrusion detection, network traffic analysis, data streams from intensive care units etc. Detecting changes is nontrivial. In this paper, an online algorithm for change detection in utility mining is proposed. In order to provide a mechanism for making quantitative description of the detected change, we adopt the statistical test.We believe there is the opportunity for an immensely rewarding synergy between data mining and statistic. Different statistical significance tests are evaluated and our study shows that the Chi-square test is the most suitable for enumerated or count data (as is the case for high utility itemsets). We demonstrate the effectiveness of the proposed method by testing it on IBM QUEST market-basket data.

[1]  Hannu Toivonen,et al.  Sampling Large Databases for Association Rules , 1996, VLDB.

[2]  Won Suk Lee,et al.  Finding recent frequent itemsets adaptively over online data streams , 2003, KDD '03.

[3]  Ada Wai-Chee Fu,et al.  Mining association rules with weighted items , 1998, Proceedings. IDEAS'98. International Database Engineering and Applications Symposium (Cat. No.98EX156).

[4]  Theodore Johnson,et al.  Sampling algorithms in a stream operator , 2005, SIGMOD '05.

[5]  Philip S. Yu,et al.  Mining Frequent Patterns in Data Streams at Multiple Time Granularities , 2002 .

[6]  A. Winsor Sampling techniques. , 2000, Nursing times.

[7]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[8]  Doron Rotem,et al.  Random Sampling from Database Files: A Survey , 1990, SSDBM.

[9]  Rajeev Motwani,et al.  Sampling from a moving window over streaming data , 2002, SODA '02.

[10]  Philip S. Yu,et al.  Moment: maintaining closed frequent itemsets over a stream sliding window , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[11]  Cory J. Butz,et al.  A Foundational Approach to Mining Itemset Utilities from Databases , 2004, SDM.

[12]  Christos Faloutsos,et al.  Online data mining for co-evolving time sequences , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[13]  Rajeev Motwani,et al.  Approximate Frequency Counts over Data Streams , 2012, VLDB.

[14]  Edith Cohen,et al.  Maintaining time-decaying stream aggregates , 2003, J. Algorithms.

[15]  Qiang Yang,et al.  Mining high utility itemsets , 2003, Third IEEE International Conference on Data Mining.

[16]  Srinivasan Parthasarathy,et al.  Evaluation of sampling for data mining of association rules , 1997, Proceedings Seventh International Workshop on Research Issues in Data Engineering. High Performance Database Management for Large-Scale Applications.

[17]  Vincent S. Tseng,et al.  Efficient Mining of Temporal High Utility Itemsets from Data streams , 2006 .

[18]  Nan Jiang,et al.  Research issues in data stream association rule mining , 2006, SGMD.

[19]  Srinivasan Parthasarathy,et al.  Efficient progressive sampling for association rules , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[20]  D. Cheung,et al.  Maintenance of Discovered Association Rules: When to update? , 1997, DMKD.

[21]  Geoff Hulten,et al.  A General Framework for Mining Massive Data Streams , 2003 .

[22]  Stéphane Bressan,et al.  Efficiency and Effectiveness of XML Tools and Techniques and Data Integration over the Web , 2003, Lecture Notes in Computer Science.

[23]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[24]  Osamu Watanabe,et al.  Practical Algorithms for On-line Sampling , 1998, Discovery Science.

[25]  Fionn Murtagh,et al.  Weighted Association Rule Mining using weighted support and significance framework , 2003, KDD '03.

[26]  Hongjun Lu,et al.  False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams , 2004, VLDB.

[27]  Ying Liu,et al.  A Two-Phase Algorithm for Fast Discovery of High Utility Itemsets , 2005, PAKDD.

[28]  Marcello Pagano,et al.  Principles of Biostatistics , 1992 .

[29]  Howard J. Hamilton,et al.  Extracting Share Frequent Itemsets with Infrequent Subsets , 2003, Data Mining and Knowledge Discovery.

[30]  Arbee L. P. Chen,et al.  Mining Frequent Itemsets from Data Streams with a Time-Sensitive Sliding Window , 2005, SDM.

[31]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[32]  Manoranjan Dash,et al.  Efficient Reservoir Sampling for Transactional Data Streams , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[33]  Shai Ben-David,et al.  Detecting Change in Data Streams , 2004, VLDB.