A New Approach for Mining Order-Preserving Submatrices Based on All Common Subsequences

Order-preserving submatrices (OPSMs) have been applied in many fields, such as DNA microarray data analysis, automatic recommendation systems, and target marketing systems, as an important unsupervised learning model. Unfortunately, most existing methods are heuristic algorithms which are unable to reveal OPSMs entirely in NP-complete problem. In particular, deep OPSMs, corresponding to long patterns with few supporting sequences, incur explosive computational costs and are completely pruned by most popular methods. In this paper, we propose an exact method to discover all OPSMs based on frequent sequential pattern mining. First, an existing algorithm was adjusted to disclose all common subsequence (ACS) between every two row sequences, and therefore all deep OPSMs will not be missed. Then, an improved data structure for prefix tree was used to store and traverse ACS, and Apriori principle was employed to efficiently mine the frequent sequential pattern. Finally, experiments were implemented on gene and synthetic datasets. Results demonstrated the effectiveness and efficiency of this method.

[1]  George M. Church,et al.  Biclustering of Expression Data , 2000, ISMB.

[2]  G. Getz,et al.  Coupled two-way clustering analysis of gene microarray data. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[3]  Michael K. Ng,et al.  On Mining Micro-array data by Order-Preserving Submatrix , 2005, 21st International Conference on Data Engineering Workshops (ICDEW'05).

[4]  Roded Sharan,et al.  Discovering statistically significant biclusters in gene expression data , 2002, ISMB.

[5]  Hui Wang,et al.  All Common Subsequences , 2007, IJCAI.

[6]  Philip S. Yu,et al.  Enhanced biclustering on expression data , 2003, Third IEEE Symposium on Bioinformatics and Bioengineering, 2003. Proceedings..

[7]  Arlindo L. Oliveira,et al.  Biclustering algorithms for biological data analysis: a survey , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[8]  Richard M. Karp,et al.  Discovering local structure in gene expression data: the order-preserving submatrix problem , 2002, RECOMB '02.

[9]  Roger E Bumgarner,et al.  Clustering gene-expression data with repeated measurements , 2003, Genome Biology.

[10]  Hui Xiong,et al.  On the Deep Order-Preserving Submatrix Problem: A Best Effort Approach , 2012, IEEE Transactions on Knowledge and Data Engineering.

[11]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[12]  Juan A. Nepomuceno,et al.  Biclustering of Gene Expression Data by Correlation-Based Scatter Search , 2011, BioData Mining.

[13]  Zhiwei Lin,et al.  A Novel Algorithm for Counting All Common Subsequences , 2007, 2007 IEEE International Conference on Granular Computing (GRC 2007).

[14]  Pedro Larrañaga,et al.  A new measure for gene expression biclustering based on non-parametric correlation , 2013, Comput. Methods Programs Biomed..

[15]  Yasser M. Kadah,et al.  An automatic gene ontology software tool for bicluster and cluster comparisons , 2009, 2009 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology.

[16]  Aidong Zhang,et al.  Cluster analysis for gene expression data: a survey , 2004, IEEE Transactions on Knowledge and Data Engineering.

[17]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[18]  ThieleLothar,et al.  A systematic comparison and evaluation of biclustering methods for gene expression data , 2006 .

[19]  Roger E Bumgarner,et al.  Integrated genomic and proteomic analyses of a systematically perturbed metabolic network. , 2001, Science.

[20]  Eckart Zitzler,et al.  BicAT: a biclustering analysis toolbox , 2006, Bioinform..

[21]  J. Hartigan Direct Clustering of a Data Matrix , 1972 .

[22]  Mehmet Deveci,et al.  A comparative analysis of biclustering algorithms for gene expression data , 2013, Briefings Bioinform..

[23]  Andreas Zell,et al.  EDISA: extracting biclusters from multiple time-series of gene expression profiles , 2007, BMC Bioinformatics.