Constraint-Based Measures for DNA Sequence Mining using Group Search Optimization Algorithm

In this paper, we propose a 3-step DNA sequence mining algorithm, called 3s-DNASM, incorporating prefix span, length and width constraints and group search optimization. The complete mining process is comprised into following vital steps: 1) applying prefix span algorithm, 2) length and width constraints, 3) Optimal mining via group search optimization (GSO). We first present the concept of prefix span, which detects the frequent DNA sequence. Based on this prefix tree, length and width constraints are applied to handle restrictions. Finally, we adopt the group search optimization (GSO) algorithm to completeness of the mining result. The experimentation is carried out using DNA sequence dataset, and the evaluation with DNA sequence dataset showed that the 3s-DNASM system is good for sequence mining. The simulation results illustrated that when min_support=4, the number of DNA sequence mined only 29 patterns by 3s-DNASM system, and in this case prefix span mined about 2168 patterns.

[1]  Wei Liu,et al.  Frequent patterns mining in multiple biological sequences , 2013, Comput. Biol. Medicine.

[2]  D. Mustard Numerical Integration over the n-Dimensional Spherical Shell , 1964 .

[3]  Shoon Lei Win,et al.  Recognition of Promoters in DNA Sequences Using Weightily Averaged One-dependence Estimators , 2013 .

[4]  Xiyu Liu,et al.  A CLIQUE algorithm using DNA computing techniques based on closed-circle DNA sequences , 2011, Biosyst..

[5]  Kenneth A. De Jong,et al.  An Evolutionary Algorithm Approach for Feature Generation from Sequence Data and Its Application to DNA Splice Site Prediction , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[6]  Sixue Bai,et al.  An Efficiency apriori Algorithm: P_Matrix Algorithm , 2007, The First International Symposium on Data, Privacy, and E-Commerce (ISDPE 2007).

[7]  Jinyan Li,et al.  Efficient mining of emerging patterns: discovering trends and differences , 1999, KDD '99.

[8]  Gary Montague,et al.  Genetic programming: an introduction and survey of applications , 1997 .

[9]  Jie Cheng,et al.  An Algorithm for Bayesian Belief Network Construction from Data , 2004 .

[10]  Kwong-Sak Leung,et al.  Data Mining on DNA Sequences of Hepatitis B Virus , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[11]  Tzung-Pei Hong,et al.  Efficient algorithms for mining up-to-date high-utility patterns , 2015, Adv. Eng. Informatics.

[12]  Xindong Wu,et al.  PMBC: Pattern mining from biological sequences with wildcard constraints , 2013, Comput. Biol. Medicine.

[13]  Q. Henry Wu,et al.  Group Search Optimizer: An Optimization Algorithm Inspired by Animal Searching Behavior , 2009, IEEE Transactions on Evolutionary Computation.

[14]  D. Molodtsov Soft set theory—First results , 1999 .

[15]  Sankar K. Pal,et al.  Soft data mining, computational theory of perceptions, and rough-fuzzy approach , 2004, Inf. Sci..

[16]  David Maxwell Chickering,et al.  Learning Bayesian Networks: The Combination of Knowledge and Statistical Data , 1994, Machine Learning.

[17]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[18]  Rajeev Motwani,et al.  Beyond market baskets: generalizing association rules to correlations , 1997, SIGMOD '97.

[19]  Philip Hingston Using Finite State Automata for Sequence Mining , 2002, ACSC.

[20]  Lin Ya-ping,et al.  Gene cluster algorithm based on most similarity tree , 2005, Eighth International Conference on High-Performance Computing in Asia-Pacific Region (HPCASIA'05).

[21]  Atsuyoshi Nakamura,et al.  Mining approximate patterns with frequent locally optimal occurrences , 2016, Discret. Appl. Math..

[22]  Madhuri S. Mulekar Data Mining: Multimedia, Soft Computing, and Bioinformatics , 2004, Technometrics.

[23]  Ramakrishnan Srikant,et al.  Mining sequential patterns , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[24]  Ramakrishnan Srikant,et al.  Mining Sequential Patterns: Generalizations and Performance Improvements , 1996, EDBT.