Mining Contiguous Sequential Generators in Biological Sequences

The discovery of conserved sequential patterns in biological sequences is essential to unveiling common shared functions. Mining sequential generators as well as mining closed sequential patterns can contribute to a more concise result set than mining all sequential patterns, especially in the analysis of big data in bioinformatics. Previous studies have also presented convincing arguments that the generator is preferable to the closed pattern in inductive inference and classification. However, classic sequential generator mining algorithms, due to the lack of consideration on the contiguous constraint along with the lower-closed one, still pose a great challenge at spawning a large number of inefficient and redundant patterns, which is too huge for effective usage. Driven by some extensive applications of patterns with contiguous feature, we propose ConSgen, an efficient algorithm for discovering contiguous sequential generators. It adopts the n-gram model, called shingles, to generate potential frequent subsequences and leverages several pruning techniques to prune the unpromising parts of search space. And then, the contiguous sequential generators are identified by using the equivalence class-based lower-closure checking scheme. Our experiments on both DNA and protein data sets demonstrate the compactness, efficiency, and scalability of ConSgen.

[1]  Jiawei Han,et al.  BIDE: efficient mining of frequent closed sequences , 2004, Proceedings. 20th International Conference on Data Engineering.

[2]  Robert Gentleman,et al.  Discriminative motif analysis of high-throughput dataset , 2014, Bioinform..

[3]  Chi Lap Yip,et al.  A GSP-based Efficient Algorithm for Mining Frequent Sequences , 2001 .

[4]  Jinyan Li,et al.  Mining and Ranking Generators of Sequential Patterns , 2008, SDM.

[5]  Frank Neven,et al.  Mining Minimal Motif Pair Sets Maximally Covering Interactions in a Protein-Protein Interaction Network , 2013, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[6]  Andrew K. C. Wong,et al.  Aligning and Clustering Patterns to Reveal the Protein Functionality of Sequences , 2014, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[7]  Antonio Gomariz,et al.  SPMF: a Java open-source pattern mining library , 2014, J. Mach. Learn. Res..

[8]  Roque Marín,et al.  A tree structure for event-based sequence mining , 2012, Knowl. Based Syst..

[9]  Mohammed J. Zaki,et al.  SPADE: An Efficient Algorithm for Mining Frequent Sequences , 2004, Machine Learning.

[10]  Philip Machanick,et al.  MEME-ChIP: motif analysis of large DNA datasets , 2011, Bioinform..

[11]  Antonio Gomariz,et al.  VGEN: Fast Vertical Mining of Sequential Generator Patterns , 2014, DaWaK.

[12]  Timothy L. Bailey,et al.  Gene expression Advance Access publication May 4, 2011 DREME: motif discovery in transcription factor ChIP-seq data , 2011 .

[13]  Yinglin Wang,et al.  Automatic Learning Common Definitional Patterns from Multi-domain Wikipedia Pages , 2014, 2014 IEEE International Conference on Data Mining Workshop.

[14]  Johannes Gehrke,et al.  Sequential PAttern mining using a bitmap representation , 2002, KDD.

[15]  Keun Ho Ryu,et al.  Mining maximal frequent patterns by considering weight conditions over data streams , 2014, Knowl. Based Syst..

[16]  Rakesh Agarwal,et al.  Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.

[17]  Jiawei Han,et al.  Frequent Closed Sequence Mining without Candidate Maintenance , 2007, IEEE Transactions on Knowledge and Data Engineering.

[18]  Jianyong Wang,et al.  Efficient mining of frequent sequence generators , 2008, WWW.

[19]  Luís Flores,et al.  Mining viral proteins for antimicrobial and cell-penetrating drug delivery peptides , 2015, Bioinform..

[20]  Yuanyuan Zhang,et al.  An effective algorithm for mining sequential generators , 2011 .

[21]  Nicolas Pasquier,et al.  Discovering Frequent Closed Itemsets for Association Rules , 1999, ICDT.

[22]  Keun Ho Ryu,et al.  Efficient frequent pattern mining based on Linear Prefix tree , 2014, Knowl. Based Syst..

[23]  Yinglin Wang,et al.  An interaction framework of service-oriented ontology learning , 2012, CIKM '12.

[24]  Umeshwar Dayal,et al.  FreeSpan: frequent pattern-projected sequential pattern mining , 2000, KDD '00.

[25]  Sun-Yuan Hsieh,et al.  An Improved Heuristic Algorithm for Finding Motif Signals in DNA Sequences , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[26]  Jorma Rissanen,et al.  Minimum Description Length Principle , 2010, Encyclopedia of Machine Learning.

[27]  Jun Zhang,et al.  FOGGER: an algorithm for graph generator discovery , 2009, EDBT '09.

[28]  Chieh-Yuan Tsai,et al.  A Location-Item-Time sequential pattern mining algorithm for route recommendation , 2015, Knowl. Based Syst..

[29]  Shiwei Tang,et al.  Efficient algorithms for incremental maintenance of closed sequential patterns in large databases , 2009, Data Knowl. Eng..

[30]  Jiawei Han,et al.  CloseGraph: mining closed frequent graph patterns , 2003, KDD '03.

[31]  Siau-Cheng Khoo,et al.  Mining and Ranking Generators of Sequential Pattern , 2008, SDM 2008.

[32]  Kwong-Sak Leung,et al.  Data Mining on DNA Sequences of Hepatitis B Virus , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[33]  Philip S. Yu,et al.  Efficiently mining frequent closed partial orders , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[34]  Xifeng Yan,et al.  CloSpan: Mining Closed Sequential Patterns in Large Datasets , 2003, SDM.

[35]  Zhiyong Lu,et al.  tmVar: a text mining approach for extracting sequence variants in biomedical literature , 2013, Bioinform..

[36]  Ramakrishnan Srikant,et al.  Mining sequential patterns , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[37]  Lidan Shou,et al.  Splitter: Mining Fine-Grained Sequential Patterns in Semantic Trajectories , 2014, Proc. VLDB Endow..

[38]  David Wai-Lok Cheung,et al.  Efficient Algorithms for Mining and Incremental Update of Maximal Frequent Sequences , 2005, Data Mining and Knowledge Discovery.

[39]  Soon Myoung Chung,et al.  Efficient Mining of Maximal Sequential Patterns Using Multiple Samples , 2005, SDM.

[40]  Yinglin Wang,et al.  CCSpan: Mining closed contiguous sequential patterns , 2015, Knowl. Based Syst..

[41]  Shunji Tanaka,et al.  Improved exact enumerative algorithms for the planted (l, d)-motif search problem , 2014, TCBB.

[42]  Jiawei Han,et al.  Frequent pattern mining: current status and future directions , 2007, Data Mining and Knowledge Discovery.

[43]  Ramakrishnan Srikant,et al.  Mining Sequential Patterns: Generalizations and Performance Improvements , 1996, EDBT.

[44]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[45]  Roberto Marangoni,et al.  BpMatch: An Efficient Algorithm for a Segmental Analysis of Genomic Sequences , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[46]  Jianyong Wang,et al.  Mining sequential patterns by pattern-growth: the PrefixSpan approach , 2004, IEEE Transactions on Knowledge and Data Engineering.