Discovery of Delta Closed Patterns and Noninduced Patterns from Sequences

Discovering patterns from sequence data has significant impact in many aspects of science and society, especially in genomics and proteomics. Here we consider multiple strings as input sequence data and substrings as patterns. In the real world, usually a large set of patterns could be discovered yet many of them are redundant, thus degrading the output quality. This paper improves the output quality by removing two types of redundant patterns. First, the notion of delta tolerance closed itemset is employed to remove redundant patterns that are not delta closed. Second, the concept of statistically induced patterns is proposed to capture redundant patterns which seem to be statistically significant yet their significance is induced by their strong significant subpatterns. It is computationally intense to mine these nonredundant patterns (delta closed patterns and noninduced patterns). To efficiently discover these patterns in very large sequence data, two efficient algorithms have been developed through innovative use of suffix tree. Three sets of experiments were conducted to evaluate their performance. They render excellent results when applying to genomics. The experiments confirm that the proposed algorithms are efficient and that they produce a relatively small set of patterns which reveal interesting information in the sequences.

[1]  Jinyan Li,et al.  Mining statistically important equivalence classes and delta-discriminative emerging patterns , 2007, KDD '07.

[2]  Wilfred Ng,et al.  \delta-Tolerance Closed Frequent Itemsets , 2006, Sixth International Conference on Data Mining (ICDM'06).

[3]  Stefano Lonardi,et al.  Efficient Detection of Unusual Words , 2000, J. Comput. Biol..

[4]  Jiawei Han,et al.  BIDE: efficient mining of frequent closed sequences , 2004, Proceedings. 20th International Conference on Data Engineering.

[5]  Howard J. Hamilton,et al.  Interestingness measures for data mining: A survey , 2006, CSUR.

[6]  Eamonn J. Keogh,et al.  Finding surprising patterns in a time series database in linear time and space , 2002, KDD.

[7]  Andrew K. C. Wong,et al.  Synthesis and Recognition of Sequences , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[8]  Aristides Gionis,et al.  Assessing data mining results via swap randomization , 2007, TKDD.

[9]  Andrew K. C. Wong,et al.  Pattern detection in biomolecules using synthesized random sequence , 1996, Pattern Recognit..

[10]  Qiming Chen,et al.  PrefixSpan,: mining sequential patterns efficiently by prefix-projected pattern growth , 2001, Proceedings 17th International Conference on Data Engineering.

[11]  Wilfred Ng,et al.  δ-Tolerance Closed Frequent Itemsets , 2006 .

[12]  M. Tompa,et al.  Discovery of novel transcription factor binding sites by statistical overrepresentation. , 2002, Nucleic acids research.

[13]  S. Haberman The Analysis of Residuals in Cross-Classified Tables , 1973 .

[14]  Xifeng Yan,et al.  CloSpan: Mining Closed Sequential Patterns in Large Datasets , 2003, SDM.

[15]  Philip S. Yu,et al.  Mining Surprising Periodic Patterns , 2004, Data Mining and Knowledge Discovery.

[16]  Nicolas Pasquier,et al.  Discovering Frequent Closed Itemsets for Association Rules , 1999, ICDT.

[17]  Lucas Chi Kwong Hui,et al.  Color Set Size Problem with Application to String Matching , 1992, CPM.

[18]  Balaji Padmanabhan,et al.  A Belief-Driven Method for Discovering Unexpected Patterns , 1998, KDD.

[19]  Srinivasan Parthasarathy,et al.  Incremental and interactive sequence mining , 1999, CIKM '99.

[20]  Jiawei Han,et al.  IncSpan: incremental mining of sequential patterns in large database , 2004, KDD.

[21]  Haoyuan Li,et al.  Mining Unexpected Sequential Patterns and Rules , 2007 .

[22]  Srinivas Aluru,et al.  Lookup Tables, Suffix Trees and Suffix Arrays , 2006 .

[23]  Jinlin Chen Contiguous item sequential pattern mining using UpDown Tree , 2008, Intell. Data Anal..

[24]  Graziano Pesole,et al.  An algorithm for finding signals of unknown length in DNA sequences , 2001, ISMB.

[25]  Philip S. Yu,et al.  Infominer: mining surprising periodic patterns , 2001, KDD '01.

[26]  Andrew K. C. Wong,et al.  Discovery of Non-induced Patterns from Sequences , 2010, PRIB.

[27]  Balaji Padmanabhan,et al.  On characterization and discovery of minimal unexpected patterns in rule discovery , 2006, IEEE Transactions on Knowledge and Data Engineering.

[28]  Mathieu Blanchette,et al.  Separating real motifs from their artifacts , 2001, ISMB.

[29]  Ramakrishnan Srikant,et al.  Mining Sequential Patterns: Generalizations and Performance Improvements , 1996, EDBT.

[30]  Jianyong Wang,et al.  Efficiently Mining Closed Subsequences with Gap Constraints , 2008, SDM.

[31]  Cláudia Antunes,et al.  Generalization of Pattern-Growth Methods for Sequential Pattern Mining with Gap Constraints , 2003, MLDM.

[32]  Ramakrishnan Srikant,et al.  Mining sequential patterns , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[33]  Keith C. C. Chan,et al.  APACS: a system for the automatic analysis and classification of conceptual patterns , 1990, Comput. Intell..

[34]  Matti Nykänen,et al.  Efficient Discovery of Statistically Significant Association Rules , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[35]  William Stafford Noble,et al.  Assessing computational tools for the discovery of transcription factor binding sites , 2005, Nature Biotechnology.

[36]  Yang Wang,et al.  High-Order Pattern Discovery from Discrete-Valued Data , 1997, IEEE Trans. Knowl. Data Eng..