论文信息 - CUBS: Multivariate Sequence Classification Using Bounded Z-score with Sampling

CUBS: Multivariate Sequence Classification Using Bounded Z-score with Sampling

Multivariate temporal sequence classification is an important and challenging task. Several attempts to address this problem exist, but none provide a full solution. In this paper we present CUBS: Classification Using Bounded Z-Score with Sampling. CUBS uses item set mining to produce frequent subsequences, and then selects among them the statistically significant subsequences to compose a classification model. We introduce an improved item set mining algorithm that solves the short sequence bias present in many item set mining algorithms. Unfortunately, the z-score normalization hinders pruning. We provide a bound on the z-score to address this issue. Calculation of the z-score normalization requires knowledge of some statistical values of the data gathered using a small sample of the database. The sampling causes a distortion in the values. We analyze this distortion and correct it. We evaluate CUBS for accuracy and scalability on a synthetic dataset and on two real world dataset. The results demonstrate how short subsequence bias is solved in the mining, and show how our bound and sampling technique enable speedup.

Sarit Kraus | Ariella Richardson | Gal A. Kaminka

[1] Ahmed Awad E. Ahmed,et al. A New Biometric Technology Based on Mouse Dynamics , 2007, IEEE Transactions on Dependable and Secure Computing.

[2] Vasant Honavar,et al. Combining Super-Structuring and Abstraction on Sequence Classification , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[3] Chedy Raïssi,et al. Sampling for Sequential Pattern Mining: From Static Databases to Data Streams , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[4] George Karypis,et al. SLPMiner: an algorithm for finding frequent sequential patterns using length-decreasing support constraint , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[5] Ramakrishnan Srikant,et al. The Quest Data Mining System , 1996, KDD.

[6] Hongxing He,et al. Feature Selection for Temporal Health Records , 2001, PAKDD.

[7] Unil Yun,et al. An efficient mining of weighted frequent patterns with length decreasing support constraints , 2008, Knowl. Based Syst..

[8] Fabio Crestani,et al. Discovering Significant Patterns in Multi-stream Sequences , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[9] Mohammed J. Zaki,et al. SPADE: An Efficient Algorithm for Mining Frequent Sequences , 2004, Machine Learning.

[10] Srinivasan Parthasarathy,et al. Evaluation of sampling for data mining of association rules , 1997, Proceedings Seventh International Workshop on Research Issues in Data Engineering. High Performance Database Management for Large-Scale Applications.

[11] Terence Sim,et al. Keystroke Dynamics in a General Setting , 2007, ICB.

[12] Tomasz Imielinski,et al. Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[13] Yan Liu,et al. Learning Temporal Causal Graphs for Relational Time-Series Analysis , 2010, ICML.

[14] Gal A. Kaminka,et al. Removing biases in unsupervised learning of sequential patterns , 2007, Intell. Data Anal..