MACFP: Maximal Approximate Consecutive Frequent Pattern Mining under Edit Distance

Consecutive pattern mining aiming at finding sequential patterns substrings, is a special case of frequent pattern mining and has been played a crucial role in many real world applications, especially in biological sequence analysis, time series analysis, and network log mining. Approximations, including insertions, deletions, and substitutions, between strings are widely used in biological sequence comparisons. However, most existing string pattern mining methods only consider hamming distance without insertions/deletions (indels). Little attention has been paid to the general approximate consecutive frequent pattern mining under edit distance, potentially due to the high computational complexity, particularly on DNA sequences with billions of base pairs. In this paper, we introduce an efficient solution to this problem. We first formulate the Maximal Approximate Consecutive Frequent Pattern Mining (MACFP) problem that identifies substring patterns under edit distance in a long query sequence. Then, we propose a novel algorithm with linear time complexity to check whether the support of a substring pattern is above a predefined threshold in the query sequence, thus greatly reducing the computational complexity of MACFP. With this fast decision algorithm, we can efficiently solve the original pattern discovery problem with several indexing and searching techniques. Comprehensive experiments on sequence pattern analysis and a study on cancer genomics application demonstrate the effectiveness and efficiency of our algorithm, compared to several existing methods.

[1]  Jiawei Han,et al.  Efficient mining of partial periodic patterns in time series database , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[2]  Éva Tardos,et al.  Algorithm design , 2005 .

[3]  Gregory Kucherov,et al.  mreps: efficient and flexible detection of tandem repeats in DNA , 2003, Nucleic Acids Res..

[4]  Jiawei Han,et al.  Efficient Mining of Closed Repetitive Gapped Subsequences from a Sequence Database , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[5]  Eamonn J. Keogh,et al.  Exact indexing of dynamic time warping , 2002, Knowledge and Information Systems.

[6]  J. Stoye,et al.  REPuter: the manifold applications of repeat analysis on a genomic scale. , 2001, Nucleic acids research.

[7]  Chao Liu,et al.  Efficient mining of iterative patterns for software specification discovery , 2007, KDD '07.

[8]  Qiming Chen,et al.  PrefixSpan,: mining sequential patterns efficiently by prefix-projected pattern growth , 2001, Proceedings 17th International Conference on Data Engineering.

[9]  Divesh Srivastava,et al.  Approximate String Processing , 2011, Found. Trends Databases.

[10]  Peter Sanders,et al.  Simple Linear Work Suffix Array Construction , 2003, ICALP.

[11]  Ramakrishnan Srikant,et al.  Mining Association Rules with Item Constraints , 1997, KDD.

[12]  Enno Ohlebusch,et al.  Replacing suffix trees with enhanced suffix arrays , 2004, J. Discrete Algorithms.

[13]  Das Amrita,et al.  Mining Association Rules between Sets of Items in Large Databases , 2013 .

[14]  Dimitrios Gunopulos,et al.  Discovering similar multidimensional trajectories , 2002, Proceedings 18th International Conference on Data Engineering.

[15]  Kyuseok Shim,et al.  SPIRIT: Sequential Pattern Mining with Regular Expression Constraints , 1999, VLDB.

[16]  Nikos Mamoulis,et al.  Mining frequent spatio-temporal sequential patterns , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[17]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.

[18]  Sanguthevar Rajasekaran,et al.  Improved algorithms for finding edit distance based motifs , 2015, 2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[19]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[20]  Faraz Hach,et al.  mrsFAST: a cache-oblivious algorithm for short-read mapping , 2010, Nature Methods.

[21]  Volker Heun,et al.  Theoretical and Practical Improvements on the RMQ-Problem, with Applications to LCA and LCE , 2006, CPM.

[22]  Faraz Hach,et al.  mrsFAST-Ultra: a compact, SNP-aware mapper for high performance sequencing applications , 2014, Nucleic Acids Res..

[23]  Jason Li,et al.  CONTRA: copy number analysis for targeted resequencing , 2012, Bioinform..

[24]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[25]  Moustafa Ghanem,et al.  String Mining in Bioinformatics , 2010, Scientific Data Mining and Knowledge Discovery.

[26]  Edward M. McCreight,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[27]  James H. Martin,et al.  Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 2nd Edition , 2000, Prentice Hall series in artificial intelligence.

[28]  G. Benson,et al.  Tandem repeats finder: a program to analyze DNA sequences. , 1999, Nucleic acids research.

[29]  Xin Chen,et al.  A compression algorithm for DNA sequences and its applications in genome comparison , 2000, RECOMB '00.

[30]  Michael K. Reiter,et al.  Traffic Aggregation for Malware Detection , 2008, DIMVA.

[31]  Beng Chin Ooi,et al.  Efficiently Supporting Edit Distance Based String Similarity Search Using B $^+$-Trees , 2014, IEEE Trans. Knowl. Data Eng..

[32]  Philip S. Yu,et al.  Mining Colossal Frequent Patterns by Core Pattern Fusion , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[33]  Gary Benson,et al.  Tandem repeats over the edit distance , 2007, Bioinform..

[34]  Philip S. Yu,et al.  Mining Frequent Approximate Sequential Patterns , 2008, Next Generation of Data Mining.

[35]  Yuanyuan Zhou,et al.  CP-Miner: finding copy-paste and related bugs in large-scale software code , 2006, IEEE Transactions on Software Engineering.

[36]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[37]  P. S. Sastry,et al.  A fast algorithm for finding frequent episodes in event streams , 2007, KDD '07.