Sequence Classification Based on Delta-Free Sequential Patterns

Sequential pattern mining is one of the most studied and challenging tasks in data mining. However, the extension of well-known methods from many other classical patterns to sequences is not a trivial task. In this paper we study the notion of δ-freeness for sequences. While this notion has extensively been discussed for itemsets, this work is the first to extend it to sequences. We define an efficient algorithm devoted to the extraction of δ-free sequential patterns. Furthermore, we show the advantage of the δ-free sequences and highlight their importance when building sequence classifiers, and we show how they can be used to address the feature selection problem in statistical classifiers, as well as to build symbolic classifiers which optimizes both accuracy and earliness of predictions.

[1]  Chedy Raïssi,et al.  Mining conjunctive sequential patterns , 2008, Data Mining and Knowledge Discovery.

[2]  Dimitrios I. Fotiadis,et al.  An optimized sequential pattern matching methodology for sequence classification , 2009, Knowledge and Information Systems.

[3]  Toon Calders,et al.  Mining All Non-derivable Frequent Itemsets , 2002, PKDD.

[4]  Siau-Cheng Khoo,et al.  Mining and Ranking Generators of Sequential Pattern , 2008, SDM 2008.

[5]  Jean-François Boulicaut,et al.  Free-Sets: A Condensed Representation of Boolean Data for the Approximation of Frequency Queries , 2004, Data Mining and Knowledge Discovery.

[6]  Heikki Mannila,et al.  Multiple Uses of Frequent Sets and Condensed Representations (Extended Abstract) , 1996, KDD.

[7]  Andrew McCallum,et al.  Using Maximum Entropy for Text Classification , 1999 .

[8]  George Karypis,et al.  Evaluation of Techniques for Classifying Biological Sequences , 2002, PAKDD.

[9]  Jinyan Li,et al.  Mining and Ranking Generators of Sequential Patterns , 2008, SDM.

[10]  Antonia J. Jones,et al.  Feature selection for genetic sequence classification , 1998, Bioinform..

[11]  Jian Pei,et al.  A brief survey on sequence classification , 2010, SKDD.

[12]  Dmitriy Fradkin,et al.  Margin-closed frequent sequential pattern mining , 2010, UP '10.

[13]  Johannes Fürnkranz,et al.  From Local Patterns to Global Models: The LeGo Approach to Data Mining , 2008 .

[14]  Christopher D. Carothers,et al.  VOGUE: A variable order hidden Markov model with duration based on frequent sequence mining , 2010, TKDD.

[15]  Li Wei,et al.  Semi-supervised time series classification , 2006, KDD '06.

[16]  Hiroshi Motoda,et al.  Computational Methods of Feature Selection , 2022 .

[17]  Ramakrishnan Srikant,et al.  Mining sequential patterns , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[18]  Henrik Grosskreutz,et al.  A Relevance Criterion for Sequential Patterns , 2013, ECML/PKDD.

[19]  François Yvon,et al.  Practical Very Large Scale CRFs , 2010, ACL.

[20]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[21]  Qiming Chen,et al.  PrefixSpan,: mining sequential patterns efficiently by prefix-projected pattern growth , 2001, Proceedings 17th International Conference on Data Engineering.

[22]  Ke Wang,et al.  Frequent-subsequence-based prediction of outer membrane proteins , 2003, KDD '03.

[23]  Jean-François Boulicaut,et al.  Simplest Rules Characterizing Classes Generated by δ-Free Sets , 2003 .

[24]  Jianyong Wang,et al.  Efficient mining of frequent sequence generators , 2008, WWW.

[25]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[26]  Jian Pei,et al.  Minimum Description Length Principle: Generators Are Preferable to Closed Patterns , 2006, AAAI.

[27]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[28]  Hiroshi Motoda,et al.  Book Review: Computational Methods of Feature Selection , 2007, The IEEE intelligent informatics bulletin.

[29]  Xifeng Yan,et al.  CloSpan: Mining Closed Sequential Patterns in Large Datasets , 2003, SDM.

[30]  Vincent S. Tseng,et al.  Effective temporal data classification by integrating sequential pattern mining and probabilistic induction , 2009, Expert Syst. Appl..

[31]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[32]  Bernhard Schölkopf,et al.  Dynamic Alignment Kernels , 2000 .

[33]  Jianyong Wang,et al.  Mining sequential patterns by pattern-growth: the PrefixSpan approach , 2004, IEEE Transactions on Knowledge and Data Engineering.

[34]  Charu C. Aggarwal,et al.  A Survey of Text Classification Algorithms , 2012, Mining Text Data.

[35]  Nicolas Pasquier,et al.  Discovering Frequent Closed Itemsets for Association Rules , 1999, ICDT.

[36]  Heikki Mannila,et al.  Levelwise Search and Borders of Theories in Knowledge Discovery , 1997, Data Mining and Knowledge Discovery.

[37]  Heikki Mannila,et al.  Discovery of Frequent Episodes in Event Sequences , 1997, Data Mining and Knowledge Discovery.

[38]  François Rioult,et al.  Efficiently Depth-First Minimal Pattern Mining , 2014, PAKDD.

[39]  Mohammed J. Zaki,et al.  Mining features for sequence classification , 1999, KDD '99.

[40]  Sanjay Chawla,et al.  Mining for Outliers in Sequential Databases , 2006, SDM.

[41]  Jiawei Han,et al.  Frequent Closed Sequence Mining without Candidate Maintenance , 2007, IEEE Transactions on Knowledge and Data Engineering.

[42]  Philip S. Yu,et al.  Mining Sequence Classifiers for Early Prediction , 2008, SDM.

[43]  Elena Baralis,et al.  Compact Representations of Sequential Classification Rules , 2008, Data Mining: Foundations and Practice.

[44]  Minoru Kanehisa,et al.  Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid pairs , 2003, Bioinform..

[45]  C. Watkins Dynamic Alignment Kernels , 1999 .

[46]  Jiawei Han,et al.  BIDE: efficient mining of frequent closed sequences , 2004, Proceedings. 20th International Conference on Data Engineering.

[47]  Eleazar Eskin,et al.  The Spectrum Kernel: A String Kernel for SVM Protein Classification , 2001, Pacific Symposium on Biocomputing.