Novel techniques and an efficient algorithm for closed pattern mining

Abstract In this paper we show that frequent closed itemset mining and biclustering, the two most prominent application fields in pattern discovery, can be reduced to the same problem when dealing with binary (0–1) data. FCPMiner, a new powerful pattern mining method, is then introduced to mine such data efficiently. The uniqueness of the proposed method is its extendibility to non-binary data. The mining method is coupled with a novel visualization technique and a pattern aggregation method to detect the most meaningful, non-overlapping patterns. The proposed methods are rigorously tested on both synthetic and real data sets.

[1]  Riitta Lahesmaa,et al.  Tet1 and Tet2 regulate 5-hydroxymethylcytosine production and cell lineage specification in mouse embryonic stem cells. , 2011, Cell stem cell.

[2]  Jacalyn M. Huband,et al.  bigVAT: Visual assessment of cluster tendency for large data sets , 2005, Pattern Recognit..

[3]  Attila Gyenesei,et al.  Mining co-regulated gene profiles for the detection of functional associations in gene expression data , 2007, Bioinform..

[4]  Wan-Chi Siu,et al.  Use of biclustering for missing value imputation in gene expression data , 2013, Artif. Intell. Res..

[5]  Boris Cule,et al.  Mining Interesting Itemsets in Graph Datasets , 2013, PAKDD.

[6]  Lodewyk F. A. Wessels,et al.  Biclustering Sparse Binary Genomic Data , 2008, J. Comput. Biol..

[7]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[8]  Armando Blanco,et al.  Intelligent system for the analysis of microarray data using principal components and estimation of distribution algorithms , 2009, Expert Syst. Appl..

[9]  Sven Bergmann,et al.  Defining transcription modules using large-scale gene expression data , 2004, Bioinform..

[10]  Roque Marín,et al.  ClaSP: An Efficient Algorithm for Mining Frequent Closed Sequences , 2013, PAKDD.

[11]  Tzung-Pei Hong,et al.  DBV-Miner: A Dynamic Bit-Vector approach for fast mining frequent closed itemsets , 2012, Expert Syst. Appl..

[12]  Amir Hussain,et al.  A new biclustering technique based on crossing minimization , 2006, Neurocomputing.

[13]  Michael J. A. Berry,et al.  Data mining techniques - for marketing, sales, and customer support , 1997, Wiley computer publishing.

[14]  Mohammed J. Zaki,et al.  CHARM: An Efficient Algorithm for Closed Association Rule Mining , 2007 .

[15]  Hongjun Lu,et al.  On computing, storing and querying frequent patterns , 2003, KDD '03.

[16]  Ümit V. Çatalyürek,et al.  Comparative analysis of biclustering algorithms , 2010, BCB '10.

[17]  Roberto Therón,et al.  BicOverlapper: A tool for bicluster visualization , 2008, Bioinform..

[18]  Michael Burch,et al.  BiCluster Viewer: A Visualization Tool for Analyzing Gene Expression Data , 2011, ISVC.

[19]  Gösta Grahne,et al.  Efficiently Using Prefix-trees in Mining Frequent Itemsets , 2003, FIMI.

[20]  Vipin Kumar,et al.  Introduction to Data Mining , 2022, Data Mining and Machine Learning Applications.

[21]  Jesús S. Aguilar-Ruiz,et al.  A biclustering algorithm for extracting bit-patterns from binary datasets , 2011, Bioinform..

[22]  Mehmed Kantardzic,et al.  Data Mining: Concepts, Models, Methods, and Algorithms , 2002 .

[23]  Eli Upfal,et al.  Efficient Discovery of Association Rules and Frequent Itemsets through Sampling with Tight Performance Guarantees , 2012, ECML/PKDD.

[24]  János Abonyi,et al.  Biclustering of High-throughput Gene Expression Data with Bicluster Miner , 2012, 2012 IEEE 12th International Conference on Data Mining Workshops.

[25]  B Marshall,et al.  Gene Ontology Consortium: The Gene Ontology (GO) database and informatics resource , 2004, Nucleic Acids Res..

[26]  José Francisco Martínez Trinidad,et al.  Mining frequent patterns and association rules using similarities , 2013, Expert Syst. Appl..

[27]  Fabio Vandin,et al.  Finding the True Frequent Itemsets , 2013, SDM.

[28]  Jian Pei,et al.  CLOSET+: searching for the best strategies for mining frequent closed itemsets , 2003, KDD '03.

[29]  Jorng-Tzong Horng,et al.  An expert system to identify co-regulated gene groups from time-lagged gene clusters using cell cycle expression data , 2010, Expert Syst. Appl..

[30]  Panos M. Pardalos,et al.  Biclustering in data mining , 2008, Comput. Oper. Res..

[31]  Lothar Thiele,et al.  A systematic comparison and evaluation of biclustering methods for gene expression data , 2006, Bioinform..

[32]  Jian Pei,et al.  CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets , 2000, ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery.

[33]  A. K. Sachan,et al.  A Survey on Frequent Itemset Mining with Association Rules , 2012 .

[34]  Rajeev Motwani,et al.  Dynamic itemset counting and implication rules for market basket data , 1997, SIGMOD '97.

[35]  Arlindo L. Oliveira,et al.  Biclustering algorithms for biological data analysis: a survey , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[36]  J. Rice Mathematical Statistics and Data Analysis , 1988 .

[37]  Richard M. Karp,et al.  Discovering local structure in gene expression data: the order-preserving submatrix problem. , 2003 .

[38]  Jiawei Han,et al.  Frequent pattern mining: current status and future directions , 2007, Data Mining and Knowledge Discovery.

[39]  Susumu Goto,et al.  The KEGG resource for deciphering the genome , 2004, Nucleic Acids Res..

[40]  Hans-Peter Kriegel,et al.  Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering , 2009, TKDD.

[41]  Nicolas Pasquier,et al.  Discovering Frequent Closed Itemsets for Association Rules , 1999, ICDT.

[42]  Mehmet Deveci,et al.  A comparative analysis of biclustering algorithms for gene expression data , 2013, Briefings Bioinform..

[43]  Salvatore Orlando,et al.  Mining Top-K Patterns from Binary Datasets in Presence of Noise , 2010, SDM.

[44]  George M. Church,et al.  Biclustering of Expression Data , 2000, ISMB.

[45]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[46]  Salvatore Orlando,et al.  Fast and memory efficient mining of frequent closed itemsets , 2006, IEEE Transactions on Knowledge and Data Engineering.

[47]  T. M. Murali,et al.  Extracting Conserved Gene Expression Motifs from Gene Expression Data , 2002, Pacific Symposium on Biocomputing.

[48]  J. Hartigan Direct Clustering of a Data Matrix , 1972 .

[49]  Albert Y. Zomaya,et al.  Biological Knowledge Discovery Handbook: Preprocessing, Mining and Postprocessing of Biological Data , 2013 .

[50]  Ying Xu,et al.  QUBIC: a qualitative biclustering algorithm for analyses of gene expression data , 2009, Nucleic acids research.

[51]  Boris Cule,et al.  Itemset Based Sequence Classification , 2013, ECML/PKDD.