论文信息 - CISpan: Comprehensive Incremental Mining Algorithms of Closed Sequential Patterns for Multi-Versional Software Mining

CISpan: Comprehensive Incremental Mining Algorithms of Closed Sequential Patterns for Multi-Versional Software Mining

Recently, frequent sequential pattern mining algorithms have been widely used in software engineering field to mine various source code or specification patterns. In practice, software evolves from one version to another in its life span. The effort of mining frequent sequential patterns across multiple versions of a software can be substantially reduced by efficient incremental mining. This problem is challenging in this domain since the databases are usually updated in all kinds of manners including insertion, various modifications as well as removal of sequences. Also, different mining tools may have various mining constraints, such as low minimum support. None of the existing work can be applied effectively due to various limitations of such work. For example, our recent work, IncSpan, failed solving the problem because it could neither handle low minimum support nor removal of sequences from database. In this paper, we propose a novel, comprehensive incremental mining algorithm for frequent sequential pattern, CISpan (Comprehensive Incremental Sequential Pattern mining). CISpan supports both closed and complete incremental frequent sequence mining, with all kinds of updates to the database. Compared to IncSpan, CISpan tolerates a wide range for minimum support threshold (as low as 2). Our performance study shows that in addition to handling more test cases on which IncSpan fails, CISpan outperforms IncSpan in all test cases which IncSpan could handle, including various sequence length, number of sequences, modification ratio, etc., with an average of 3.4 times speedup. We also tested CISpan’s performance on databases transformed from 20 consecutive versions of Linux Kernel source code. On average, CISpan outperforms the non-incremental CloSpan by 42 times.

[1] Ramakrishnan Srikant,et al. Mining sequential patterns , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[2] Ramakrishnan Srikant,et al. Mining Sequential Patterns: Generalizations and Performance Improvements , 1996, EDBT.

[3] Srinivasan Parthasarathy,et al. Incremental and interactive sequence mining , 1999, CIKM '99.

[4] Qiming Chen,et al. PrefixSpan,: mining sequential patterns efficiently by prefix-projected pattern growth , 2001, Proceedings 17th International Conference on Data Engineering.

[5] David Wai-Lok Cheung,et al. Efficient Algorithms for Incremental Update of Frequent Sequences , 2002, PAKDD.

[6] Xifeng Yan,et al. CloSpan: Mining Closed Sequential Patterns in Large Datasets , 2003, SDM.

[7] Maguelonne Teisseire,et al. Incremental mining of sequential patterns in large databases , 2003, Data Knowl. Eng..

[8] Yuanyuan Zhou,et al. CP-Miner: A Tool for Finding Copy-paste and Related Bugs in Operating System Code , 2004, OSDI.

[9] Mohammed J. Zaki,et al. SPADE: An Efficient Algorithm for Mining Frequent Sequences , 2004, Machine Learning.

[10] Jiawei Han,et al. IncSpan: incremental mining of sequential patterns in large database , 2004, KDD.

[11] Zhenmin Li,et al. PR-Miner: automatically extracting implicit programming rules and detecting violations in large software code , 2005, ESEC/FSE-13.

[12] Benjamin Livshits,et al. DynaMine: finding common error patterns by mining software revision histories , 2005, ESEC/FSE-13.

[13] Jian Pei,et al. MAPO: mining API usages from open source repositories , 2006, MSR '06.

[14] Jian Pei,et al. Mining API patterns as partial orders from source code: from usage scenarios to specifications , 2007, ESEC-FSE '07.

[15] Shiwei Tang,et al. IMCS: Incremental Mining of Closed Sequential Patterns , 2007, APWeb/WAIM.

[16] Xiao Ma,et al. MUVI: automatically inferring multi-variable access correlations and detecting related semantic and concurrency bugs , 2007, SOSP.

[17] Chao Liu,et al. Efficient mining of iterative patterns for software specification discovery , 2007, KDD '07.