Parallel Algorithm for Mining Frequent Closed Sequences

Previous studies have presented convincing arguments that a frequent sequence mining algorithm should not mine all frequent sequences but only the closed ones because the latter leads to not only more compact yet complete result set but also better efficiency. However, frequent closed sequence mining is still challenging on stand-alone for its large size and high dimension. In this paper, an algorithm, PFCSeq, is presented for mining frequent closed sequence based on distributed-memory parallel machine, in which each processor mines local frequent closed sequence set independently using task parallelism with data parallelism approach, and only two communications are needed except that imbalance is detected. Therefore, time spent in communications is significantly reduced. In order to ensure good load balance among processors, a dynamic workload balance strategy is proposed. Experiments show that it is linearly scalable in terms of database size and the number of processors.

[1]  Masaru Kitsuregawa,et al.  Mining Algorithms for Sequential Patterns in Parallel: Hash Based Approach , 1998, PAKDD.

[2]  Rakesh Agrawal,et al.  Parallel Mining of Association Rules , 1996, IEEE Trans. Knowl. Data Eng..

[3]  Johannes Gehrke,et al.  Sequential PAttern mining using a bitmap representation , 2002, KDD.

[4]  Xifeng Yan,et al.  CloSpan: Mining Closed Sequential Patterns in Large Datasets , 2003, SDM.

[5]  Ramakrishnan Srikant,et al.  Mining sequential patterns , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[6]  Jiawei Han,et al.  CloseGraph: mining closed frequent graph patterns , 2003, KDD '03.

[7]  Charu C. Aggarwal,et al.  A Tree Projection Algorithm for Generation of Frequent Item Sets , 2001, J. Parallel Distributed Comput..

[8]  Jiawei Han,et al.  BIDE: efficient mining of frequent closed sequences , 2004, Proceedings. 20th International Conference on Data Engineering.