RIn-Close_CVC2: an even more efficient enumerative algorithm for biclustering of numerical datasets

RIn-Close_CVC is an efficient (take polynomial time per bicluster), complete (find all maximal biclusters), correct (all biclusters attend the user-defined level of consistency) and non-redundant (all the obtained biclusters are maximal and the same bicluster is not enumerated more than once) enumerative algorithm for mining maximal biclusters with constant values on columns in numerical datasets. Despite RIn-Close_CVC has all these outstanding properties, it has a high computational cost in terms of memory usage because it must keep a symbol table in memory to prevent a maximal bicluster to be found more than once. In this paper, we propose a new version of RIn-Close_CVC, named RIn-Close_CVC2, that does not use a symbol table to prevent redundant biclusters, and keeps all these four properties. We also prove that these algorithms actually possess these properties. Experiments are carried out with synthetic and real-world datasets to compare RIn-Close_CVC and RIn-Close_CVC2 in terms of memory usage and runtime. The experimental results show that RIn-Close_CVC2 brings a large reduction in memory usage and, in average, significant runtime gain when compared to its predecessor.

[1]  Kazuhisa Makino,et al.  New Algorithms for Enumerating All Maximal Cliques , 2004, SWAT.

[2]  P. Khandelia,et al.  Genome-wide Analysis of Pre-mRNA Splicing , 2004, Journal of Biological Chemistry.

[3]  Antonio Ferrante [At close]. , 2005, Assistenza infermieristica e ricerca : AIR.

[4]  Bernhard Ganter,et al.  Two Basic Algorithms in Concept Analysis , 2010, ICFCA.

[5]  Claudio Carpineto,et al.  Concept data analysis - theory and applications , 2004 .

[6]  A. Kudlicki,et al.  Logic of the Yeast Metabolic Cycle: Temporal Compartmentalization of Cellular Processes , 2005, Science.

[7]  Vilém Vychodil,et al.  Advances in Algorithms Based on CbO , 2010, CLA.

[8]  David Eppstein,et al.  Listing All Maximal Cliques in Sparse Graphs in Near-optimal Time , 2010, Exact Complexity of NP-hard Problems.

[9]  Sergei O. Kuznetsov,et al.  Mathematical aspects of concept analysis , 1996 .

[10]  P. Mendes,et al.  The Genome-Wide Early Temporal Response of Saccharomyces cerevisiae to Oxidative Stress Induced by Cumene Hydroperoxide , 2013, PloS one.

[11]  Luc De Raedt,et al.  Mining Bi-sets in Numerical Data , 2006, KDID.

[12]  Peter Walter,et al.  IRE1-Independent Gain Control of the Unfolded Protein Response , 2004, PLoS biology.

[13]  Gregory W Carter,et al.  Disentangling information flow in the Ras-cAMP signaling network. , 2006, Genome research.

[14]  Fernando José Von Zuben,et al.  Efficient mining of maximal biclusters in mixed-attribute datasets , 2017, ArXiv.

[15]  Mohammed J. Zaki,et al.  CHARM: An Efficient Algorithm for Closed Itemset Mining , 2002, SDM.

[16]  Mohammed J. Zaki Generating non-redundant association rules , 2000, KDD '00.

[17]  Arlindo L. Oliveira,et al.  Biclustering algorithms for biological data analysis: a survey , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[18]  Daniel T. Larose,et al.  Discovering Knowledge in Data: An Introduction to Data Mining , 2005 .

[19]  Fernando José Von Zuben,et al.  Enumerating all maximal biclusters in numerical datasets , 2014, Inf. Sci..