CUBS: Multivariate Sequence Classification Using Bounded Z-score with Sampling

Multivariate temporal sequence classification is an important and challenging task. Several attempts to address this problem exist, but none provide a full solution. In this paper we present CUBS: Classification Using Bounded Z-Score with Sampling. CUBS uses item set mining to produce frequent subsequences, and then selects among them the statistically significant subsequences to compose a classification model. We introduce an improved item set mining algorithm that solves the short sequence bias present in many item set mining algorithms. Unfortunately, the z-score normalization hinders pruning. We provide a bound on the z-score to address this issue. Calculation of the z-score normalization requires knowledge of some statistical values of the data gathered using a small sample of the database. The sampling causes a distortion in the values. We analyze this distortion and correct it. We evaluate CUBS for accuracy and scalability on a synthetic dataset and on two real world dataset. The results demonstrate how short subsequence bias is solved in the mining, and show how our bound and sampling technique enable speedup.

[1]  Ahmed Awad E. Ahmed,et al.  A New Biometric Technology Based on Mouse Dynamics , 2007, IEEE Transactions on Dependable and Secure Computing.

[2]  Vasant Honavar,et al.  Combining Super-Structuring and Abstraction on Sequence Classification , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[3]  Chedy Raïssi,et al.  Sampling for Sequential Pattern Mining: From Static Databases to Data Streams , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[4]  George Karypis,et al.  SLPMiner: an algorithm for finding frequent sequential patterns using length-decreasing support constraint , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[5]  Ramakrishnan Srikant,et al.  The Quest Data Mining System , 1996, KDD.

[6]  Hongxing He,et al.  Feature Selection for Temporal Health Records , 2001, PAKDD.

[7]  Unil Yun,et al.  An efficient mining of weighted frequent patterns with length decreasing support constraints , 2008, Knowl. Based Syst..

[8]  Fabio Crestani,et al.  Discovering Significant Patterns in Multi-stream Sequences , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[9]  Mohammed J. Zaki,et al.  SPADE: An Efficient Algorithm for Mining Frequent Sequences , 2004, Machine Learning.

[10]  Srinivasan Parthasarathy,et al.  Evaluation of sampling for data mining of association rules , 1997, Proceedings Seventh International Workshop on Research Issues in Data Engineering. High Performance Database Management for Large-Scale Applications.

[11]  Terence Sim,et al.  Keystroke Dynamics in a General Setting , 2007, ICB.

[12]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[13]  Yan Liu,et al.  Learning Temporal Causal Graphs for Relational Time-Series Analysis , 2010, ICML.

[14]  Gal A. Kaminka,et al.  Removing biases in unsupervised learning of sequential patterns , 2007, Intell. Data Anal..