Top-k Self-Adaptive Contrast Sequential Pattern Mining

For sequence classification, an important issue is to find discriminative features, where sequential pattern mining (SPM) is often used to find frequent patterns from sequences as features. To improve classification accuracy and pattern interpretability, contrast pattern mining emerges to discover patterns with high-contrast rates between different categories. To date, existing contrast SPM methods face many challenges, including excessive parameter selection and inefficient occurrences counting. To tackle these issues, this article proposes a top- $k$ self-adaptive contrast SPM, which adaptively adjusts the gap constraints to find top- $k$ self-adaptive contrast patterns (SCPs) from positive and negative sequences. One of the key tasks of the mining problem is to calculate the support (the number of occurrences) of a pattern in each sequence. To support efficient counting, we store all occurrences of a pattern in a special array in a Nettree, an extended tree structure with multiple roots and multiple parents. We employ the array to calculate the occurrences of all its superpatterns with one-way scanning to avoid redundant calculation. Meanwhile, because the contrast SPM problem does not satisfy the Apriori property, we propose Zero and Less strategies to prune candidate patterns and a Contrast-first mining strategy to select patterns with the highest contrast rate as the prefix subpattern and calculate the contrast rate of all its superpatterns. Experiments validate the efficiency of the proposed algorithm and show that contrast patterns significantly outperform frequent patterns for sequence classification. The algorithms and datasets can be downloaded from https://github.com/wuc567/Pattern-Mining/tree/master/SCP-Miner.

[1]  Philippe Fournier-Viger,et al.  Mining High Utility Itemsets with Hill Climbing and Simulated Annealing , 2021, ACM Trans. Manag. Inf. Syst..

[2]  Kui Yu,et al.  Multi-Source Causal Feature Selection , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Zhifeng Bao,et al.  Efficient Mining of Outlying Sequence Patterns for Analyzing Outlierness of Sequence Data , 2020, ACM Trans. Knowl. Discov. Data.

[4]  Longbing Cao,et al.  e-RNSP: An Efficient Method for Mining Repetition Negative Sequential Patterns , 2020, IEEE Transactions on Cybernetics.

[5]  Yan Li,et al.  NetNCSP: Nonoverlapping closed sequential pattern mining , 2020, Knowledge-Based Systems.

[6]  Philip S. Yu,et al.  Fast Utility Mining on Sequence Data , 2020, IEEE Transactions on Cybernetics.

[7]  Xindong Wu,et al.  NetNPG: Nonoverlapping pattern matching with general gap constraints , 2020, Applied Intelligence.

[8]  Robert F. Mills,et al.  Sequence Pattern Mining with Variables , 2020, IEEE Transactions on Knowledge and Data Engineering.

[9]  Gang Hua,et al.  Order-Preserving Optimal Transport for Distances between Sequences , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Leonardo Pellegrina,et al.  Hypothesis Testing and Statistically-sound Pattern Mining , 2019, KDD.

[11]  Philippe Fournier-Viger,et al.  High average-utility sequential pattern mining based on uncertain databases , 2019, Knowledge and Information Systems.

[12]  Yan Li,et al.  Mining distinguishing subsequence patterns with nonoverlapping condition , 2018, Cluster Computing.

[13]  Srikumar Krishnamoorthy,et al.  Mining top-k high utility itemsets with effective threshold raising strategies , 2019, Expert Syst. Appl..

[14]  Longbing Cao,et al.  Mining Top- ${k}$ Useful Negative Sequential Patterns via Learning , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[15]  Philip S. Yu,et al.  HUOPM: High-Utility Occupancy Pattern Mining , 2018, IEEE Transactions on Cybernetics.

[16]  Xindong Wu,et al.  NOSEP: Nonoverlapping Sequence Pattern Mining With Gap Constraints , 2018, IEEE Transactions on Cybernetics.

[17]  Philip S. Yu,et al.  A Survey of Utility-Oriented Pattern Mining , 2018, IEEE Transactions on Knowledge and Data Engineering.

[18]  Philip S. Yu,et al.  A Survey of Parallel Sequential Pattern Mining , 2018, ACM Trans. Knowl. Discov. Data.

[19]  Wen Zhang,et al.  The linear neighborhood propagation method for predicting long non-coding RNA-protein interactions , 2018, Neurocomputing.

[20]  Bay Vo,et al.  Mining top-k co-occurrence items with sequential pattern , 2017, Expert Syst. Appl..

[21]  Marc Boullé,et al.  A user parameter-free approach for mining robust sequential classification rules , 2017, Knowledge and Information Systems.

[22]  Li Yan,et al.  Mining Top-k Distinguishing Temporal Sequential Patterns from Event Sequences , 2017, DASFAA.

[23]  Kotagiri Ramamohanarao,et al.  Septic shock prediction for ICU patients via coupled HMM walking on sequential contrast patterns , 2017, J. Biomed. Informatics.

[24]  Danny Barash,et al.  RNAPattMatch: a web server for RNA sequence/structure motif detection based on pattern matching with flexible gaps , 2015, Nucleic Acids Res..

[25]  Chedy Raïssi,et al.  Sequence Classification Based on Delta-Free Sequential Patterns , 2014, 2014 IEEE International Conference on Data Mining.

[26]  Jiadong Ren,et al.  Mining sequential patterns with periodic wildcard gaps , 2014, Applied Intelligence.

[27]  Antonio Gomariz,et al.  TKS: Efficient Mining of Top-K Sequential Patterns , 2013, ADMA.

[28]  Xindong Wu,et al.  PMBC: Pattern mining from biological sequences with wildcard constraints , 2013, Comput. Biol. Medicine.

[29]  Toon Calders,et al.  Mining Compressing Sequential Patterns , 2012, Stat. Anal. Data Min..

[30]  Ming Li,et al.  Efficient Mining of Gap-Constrained Subsequences and Its Various Applications , 2012, TKDD.

[31]  Jiawei Han,et al.  Efficient Mining of Closed Repetitive Gapped Subsequences from a Sequence Database , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[32]  Xingquan Zhu,et al.  A lazy bagging approach to classification , 2008, Pattern Recognit..

[33]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[34]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[35]  James Bailey,et al.  Mining minimal distinguishing subsequence patterns with gap constraints , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[36]  David Wai-Lok Cheung,et al.  Mining periodic patterns with gap requirement from sequences , 2005, SIGMOD '05.

[37]  Jianyong Wang,et al.  Mining sequential patterns by pattern-growth: the PrefixSpan approach , 2004, IEEE Transactions on Knowledge and Data Engineering.

[38]  Jiawei Han,et al.  TSP: Mining top-k closed sequential patterns , 2003, Third IEEE International Conference on Data Mining.

[39]  Eamonn J. Keogh,et al.  A symbolic representation of time series, with implications for streaming algorithms , 2003, DMKD '03.

[40]  Fan Min,et al.  Frequent pattern discovery with tri-partition alphabets , 2020, Inf. Sci..

[41]  Lizhen Wang,et al.  Redundancy Reduction for Prevalent Co-Location Patterns , 2018, IEEE Transactions on Knowledge and Data Engineering.

[42]  Xindong Wu,et al.  Efficient sequential pattern mining with wildcards for keyphrase extraction , 2017, Knowl. Based Syst..

[43]  Philip S. Yu,et al.  Efficient Algorithms for Mining Top-K High Utility Itemsets , 2016, IEEE Transactions on Knowledge and Data Engineering.

[44]  Xindong Wu,et al.  Strict pattern matching under non-overlapping condition , 2015, Science China Information Sciences.

[45]  Yang Ha,et al.  Mining Top-k Distinguishing Sequential Patterns with Gap Constraint , 2015 .

[46]  Antonio Gomariz,et al.  SPMF: a Java open-source pattern mining library , 2014, J. Mach. Learn. Res..

[47]  Gonzalo Navarro,et al.  Fast and Simple Character Classes and Bounded Gaps Pattern Matching, with Applications to Protein Searching , 2003, J. Comput. Biol..