A Study on the Accuracy of Frequency Measures and Its Impact on Knowledge Discovery in Single Sequences

In knowledge discovery in single sequences, different results could be discovered from the same sequence when different frequency measures are adopted. It is natural to raise such questions as (1) do these frequency measures reflect actual frequencies accurately? (2) what impacts do frequency measures have on discovered knowledge? (3) are discovered results accurate and reliable? and (4) which measures are appropriate for reflecting frequencies accurately? In this paper, taking three major factors (anti-monotonicity, maximum-frequency and window-width restriction) into account, we identify inaccuracies inherent in seven existing frequency measures, and investigate their impacts on the soundness and completeness of two kinds of knowledge, frequent episodes and episode rules, discovered from single sequences. In order to obtain more accurate frequencies and knowledge, we provide three recommendations for defining appropriate frequency measures. Following the recommendations, we introduce a more appropriate frequency measure. Empirical evaluation reveals the inaccuracies and verifies our findings.

[1]  Ramakrishnan Srikant,et al.  Mining sequential patterns , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[2]  P. S. Sastry,et al.  A fast algorithm for finding frequent episodes in event streams , 2007, KDD '07.

[3]  Jiawei Han,et al.  Efficient Mining of Closed Repetitive Gapped Subsequences from a Sequence Database , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[4]  George Karypis,et al.  Finding Frequent Patterns in a Large Sparse Graph* , 2005, Data Mining and Knowledge Discovery.

[5]  Gemma Casas-Garriga Discovering Unbounded Episodes in Sequential Data , 2003 .

[6]  Heikki Mannila,et al.  Levelwise Search and Borders of Theories in Knowledge Discovery , 1997, Data Mining and Knowledge Discovery.

[7]  Koji Iwanuma,et al.  Extracting frequent subsequences from a single long data sequence a novel anti-monotonic measure and a simple on-line algorithm , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[8]  Christian Borgelt,et al.  Subgraph Support in a Single Large Graph , 2007, Seventh IEEE International Conference on Data Mining Workshops (ICDMW 2007).

[9]  Jitender S. Deogun,et al.  Sequential Association Rule Mining with Time Lags , 2004, Journal of Intelligent Information Systems.

[10]  Heikki Mannila,et al.  Discovering Frequent Episodes in Sequences , 1995, KDD.

[11]  Kai Zhao,et al.  Comparing Reliability of Association Rules and OLAP Statistical Tests , 2008, 2008 IEEE International Conference on Data Mining Workshops.

[12]  Honghua Dai,et al.  A Study on Reliability in Graph Discovery , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[13]  Heikki Mannila,et al.  Discovering Generalized Episodes Using Minimal Occurrences , 1996, KDD.

[14]  A. Akhmetova Discovery of Frequent Episodes in Event Sequences , 2006 .

[15]  Christophe Rigotti,et al.  Constraint-Based Mining of Episode Rules and Optimal Window Sizes , 2004, PKDD.

[16]  Chia-Hui Chang,et al.  Efficient mining of frequent episodes from complex sequences , 2008, Inf. Syst..

[17]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[18]  Jacob D. Furst,et al.  Predictive Data Mining for Lung Nodule Interpretation , 2007 .

[19]  Siegfried Nijssen,et al.  What Is Frequent in a Single Graph? , 2007, PAKDD.