Mining closed and multi-supports-based sequential pattern in high-dimensional dataset

Previous mining algorithms on high dimensional datasets, such as biological dataset, create very large patterns sets as a result which includes small and discontinuous sequential patterns. These patterns do not bear any useful information for usage. Mining sequential patterns in such sequences need to consider different forms of patterns, such as contiguous patterns, local patterns which appear more than one time in a special sequence and so on. Mining closed pattern leads to a more compact result set but also a better efficiency. In this paper, a novel algorithm based on BI(directional extension and multi( supports is provided specifically for mining contiguous closed patterns in high dimensional dataset. Three kinds of contiguous closed sequential patterns are mined which are sequential patterns, local sequential patterns and total sequential patterns. Thorough performances on biological sequences have demonstrated that the proposed algorithm reduces memory consumption and generates compact patterns. A detailed analysis of the multi(supports(based results is provided in this paper.

[1]  P. S. Grover,et al.  Constraint-based sequential pattern mining: a pattern growth algorithm incorporating compactness, length and monetary , 2014, Int. Arab J. Inf. Technol..

[2]  Suh-Yin Lee,et al.  CEMiner -- An Efficient Algorithm for Mining Closed Patterns from Time Interval-Based Data , 2011, 2011 IEEE 11th International Conference on Data Mining.

[3]  Jiawei Han,et al.  Frequent pattern mining: current status and future directions , 2007, Data Mining and Knowledge Discovery.

[4]  Zhu Yangyong,et al.  BioPM:An Efficient Algorithm for Protein Motif Mining , 2007, 2007 1st International Conference on Bioinformatics and Biomedical Engineering.

[5]  Jiawei Han,et al.  Frequent Closed Sequence Mining without Candidate Maintenance , 2007, IEEE Transactions on Knowledge and Data Engineering.

[6]  Baw-Jhiune Liu,et al.  WildSpan: mining structured motifs from protein sequences , 2011, Algorithms for Molecular Biology.

[7]  Ahmad Abdollahzadeh Barforoush,et al.  Efficient colossal pattern mining in high dimensional datasets , 2012, Knowl. Based Syst..

[8]  Jianyong Wang,et al.  Mining sequential patterns by pattern-growth: the PrefixSpan approach , 2004, IEEE Transactions on Knowledge and Data Engineering.

[9]  Jesús S. Aguilar-Ruiz,et al.  Gene association analysis: a survey of frequent pattern mining from gene expression data , 2010, Briefings Bioinform..

[10]  Engelbert Mephu Nguifo,et al.  A Knowledge Discovery Framework for Learning Task Models from User Interactions in Intelligent Tutoring Systems , 2009, ArXiv.

[11]  Ho-Jin Choi,et al.  An Efficient Approach to Mining Maximal Contiguous Frequent Patterns from Large DNA Sequence Databases , 2012, Genomics & informatics.

[12]  Jeffrey Xu Yu,et al.  Scalable sequential pattern mining for biological sequences , 2004, CIKM '04.

[13]  Xindong Wu,et al.  MINING APPROXIMATE REPEATING PATTERNS FROM SEQUENCE DATA WITH GAP CONSTRAINTS , 2011, Comput. Intell..

[14]  A. Murugan,et al.  A DNA based Approach to find Closed Repetitive Gapped Subsequences from a Sequence Database , 2011 .

[15]  Charu C. Aggarwal,et al.  Frequent Pattern Mining , 2014, Springer International Publishing.

[16]  Yangyong Zhu,et al.  TOPPER: An algorithm for mining top k patterns in biological sequences based on regularity measurement , 2010, 2010 IEEE International Conference on Bioinformatics and Biomedicine Workshops (BIBMW).

[17]  Dimitrios I. Fotiadis,et al.  Mining sequential patterns for protein fold recognition , 2008, J. Biomed. Informatics.

[18]  Paulo J. Azevedo,et al.  Protein Sequence Pattern Mining with Constraints , 2005, PKDD.

[19]  B. Lavanya A DNA based Approach to find Closed Repetitive Gapped Subsequences from a Sequence Database , 2011 .

[20]  Hongyan Liu,et al.  New approach for the sequential pattern mining of high-dimensional sequence databases , 2010, Decis. Support Syst..