An Effective Pattern Pruning and Summarization Method Retaining High Quality Patterns With High Area Coverage in Relational Datasets

Pattern mining has been widely used to uncover interesting patterns from data. However, one of its main problems is that it produces too many patterns and many of them are redundant. To reduce the number of redundant patterns and retain overlapping ones, delta-closed pattern pruning was introduced, yet it can only prune subpatterns if they are covered by superpatterns. Such unduly superpatterns need to be pruned. Furthermore, in order to improve the management and interpretation of patterns, pattern summarization is proposed. It renders a small number of patterns that retain the most crucial information. RuleCover algorithm was one of such algorithms. However, it tends to produce over trivial patterns, whereas more interesting and revealing ones may be pruned. To overcome these problems, this paper presents a new algorithm which integrates delta-closed, and RuleCover methods with our other two new algorithms: 1) statistically induced pattern pruning for pruning statistically induced superpatterns by strong subpatterns and 2) AreaCover algorithm for pruning overlapping patterns but retain higher order and high quality patterns with large coverage of the data “area.” Experimental results show that the proposed algorithms produce very compact yet comprehensive knowledge from patterns discovered from relational data sets.

[1]  Andrew K. C. Wong,et al.  Unsupervised fuzzy pattern discovery in gene expression data , 2011, BMC Bioinformatics.

[2]  Andrew K. C. Wong,et al.  Partitioning and correlating subgroup characteristics from Aligned Pattern Clusters , 2016, Bioinform..

[3]  Mohammed J. Zaki,et al.  Efficient algorithms for mining closed itemsets and their lattice structure , 2005, IEEE Transactions on Knowledge and Data Engineering.

[4]  Roberto J. Bayardo,et al.  Efficiently mining long patterns from databases , 1998, SIGMOD '98.

[5]  G. Cumming Understanding the New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis , 2011 .

[6]  Andrew K. C. Wong,et al.  Discovery of Non-induced Patterns from Sequences , 2010, PRIB.

[7]  Mohammed J. Zaki,et al.  Efficiently mining maximal frequent itemsets , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[8]  Andrew K. C. Wong,et al.  Aligning and Clustering Patterns to Reveal the Protein Functionality of Sequences , 2014, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[9]  Jian Pei,et al.  CLOSET+: searching for the best strategies for mining frequent closed itemsets , 2003, KDD '03.

[10]  Wilfred Ng,et al.  δ-Tolerance Closed Frequent Itemsets , 2006 .

[11]  S. Haberman The Analysis of Residuals in Cross-Classified Tables , 1973 .

[12]  Charu C. Aggarwal,et al.  Frequent Pattern Mining Algorithms: A Survey , 2014, Frequent Pattern Mining.

[13]  Mohammed J. Zaki,et al.  CHARM: An Efficient Algorithm for Closed Association Rule Mining , 2007 .

[14]  Jian Pei,et al.  Data Mining: Concepts and Techniques, 3rd edition , 2006 .

[15]  Zvi M. Kedem,et al.  Pincer-Search: A New Algorithm for Discovering the Maximum Frequent Set , 1998, EDBT.

[16]  Chengqi Zhang,et al.  Summarizing probabilistic frequent patterns: a fast approach , 2013, KDD.

[17]  Mohammed J. Zaki,et al.  GenMax: An Efficient Algorithm for Mining Maximal Frequent Itemsets , 2005, Data Mining and Knowledge Discovery.

[18]  Yang Xiang,et al.  Effective and efficient itemset pattern summarization: regression-based approaches , 2008, KDD.

[19]  Heikki Mannila,et al.  Pruning and grouping of discovered association rules , 1995 .

[20]  Yang Wang,et al.  High-Order Pattern Discovery from Discrete-Valued Data , 1997, IEEE Trans. Knowl. Data Eng..

[21]  Andrew K. C. Wong,et al.  Typicality, Diversity, and Feature Pattern of an Ensemble , 1975, IEEE Transactions on Computers.

[22]  Johannes Gehrke,et al.  MAFIA: a maximal frequent itemset algorithm for transactional databases , 2001, Proceedings 17th International Conference on Data Engineering.

[23]  Jian Pei,et al.  CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets , 2000, ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery.

[24]  A. Wong,et al.  Statistical analysis of residue variability in cytochrome c. , 1976, Journal of molecular biology.

[25]  Andrew K. C. Wong,et al.  Simultaneous Pattern and Data Clustering for Pattern Cluster Analysis , 2008, IEEE Transactions on Knowledge and Data Engineering.

[26]  Salvatore Orlando,et al.  Fast and memory efficient mining of frequent closed itemsets , 2006, IEEE Transactions on Knowledge and Data Engineering.

[27]  Wilfred Ng,et al.  A survey on algorithms for mining frequent itemsets over data streams , 2008, Knowledge and Information Systems.

[28]  Keith C. C. Chan,et al.  APACS: a system for the automatic analysis and classification of conceptual patterns , 1990, Comput. Intell..

[29]  Jinyan Li,et al.  Mining statistically important equivalence classes and delta-discriminative emerging patterns , 2007, KDD '07.

[30]  Wilfred Ng,et al.  \delta-Tolerance Closed Frequent Itemsets , 2006, Sixth International Conference on Data Mining (ICDM'06).

[31]  Dimitrios Gunopulos,et al.  Constraint-Based Rule Mining in Large, Dense Databases , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[32]  Srinivasan Parthasarathy,et al.  Summarizing itemset patterns using probabilistic models , 2006, KDD '06.

[33]  Aristides Gionis,et al.  Approximating a collection of frequent sets , 2004, KDD.

[34]  Chung Lam Li,et al.  Association Pattern Analysis for Pattern Pruning, Clustering and Summarization , 2008 .

[35]  Andrew K. C. Wong,et al.  Discovery of Delta Closed Patterns and Noninduced Patterns from Sequences , 2012, IEEE Transactions on Knowledge and Data Engineering.

[36]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[37]  Yi-Cheng Chen,et al.  A novel algorithm for mining closed temporal patterns from interval-based data , 2014, Knowledge and Information Systems.

[38]  Jiawei Han,et al.  Summarizing itemset patterns: a profile-based approach , 2005, KDD '05.

[39]  Francesco Camastra,et al.  Offline Cursive Character Challenge: a New Benchmark for Machine Learning and Pattern Recognition Algorithms. , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[40]  Dimitrios Gunopulos,et al.  Mining frequent arrangements of temporal intervals , 2009, Knowledge and Information Systems.

[41]  Andrew K. C. Wong,et al.  Pattern discovery for large mixed-mode database , 2010, CIKM.

[42]  Andrew K. C. Wong,et al.  A fuzzy approach to partitioning continuous attributes for classification , 2006, IEEE Transactions on Knowledge and Data Engineering.

[43]  Wynne Hsu,et al.  Pruning and summarizing the discovered associations , 1999, KDD '99.