Reasoning about sets using redescription mining

Redescription mining is a newly introduced data mining problem that seeks to find subsets of data that afford multiple definitions. It can be viewed as a generalization of association rule mining, from finding implications to equivalences; as a form of conceptual clustering, where the goal is to identify clusters that afford dual characterizations; and as a form of constructive induction, to build features based on given descriptors that mutually reinforce each other. In this paper, we present the use of redescription mining as an important tool to reason about a collection of sets, especially their overlaps, similarities, and differences. We outline algorithms to mine all minimal (non-redundant) redescriptions underlying a dataset using notions of minimal generators of closed itemsets. We also show the use of these algorithms in an interactive context, supporting constraint-based exploration and querying. Specifically, we showcase a bioinformatics application that empowers the biologist to define a vocabulary of sets underlying a domain of genes and to reason about these sets, yielding significant biological insight.

[1]  Johannes Gehrke,et al.  MAFIA: a maximal frequent itemset algorithm for transactional databases , 2001, Proceedings 17th International Conference on Data Engineering.

[2]  D. Botstein,et al.  Genomic expression programs in the response of yeast cells to environmental changes. , 2000, Molecular biology of the cell.

[3]  Mohammed J. Zaki,et al.  CHARM: An Efficient Algorithm for Closed Itemset Mining , 2002, SDM.

[4]  Naren Ramakrishnan,et al.  Redescription Mining: Structure Theory and Algorithms , 2005, AAAI.

[5]  Gerd Stumme,et al.  Mining Minimal Non-redundant Association Rules Using Frequent Closed Itemsets , 2000, Computational Logic.

[6]  Lhouari Nourine,et al.  A Fast Algorithm for Building Lattices , 1999, Inf. Process. Lett..

[7]  Mohammed J. Zaki,et al.  Efficient algorithms for mining closed itemsets and their lattice structure , 2005, IEEE Transactions on Knowledge and Data Engineering.

[8]  Bernhard Ganter,et al.  Formal Concept Analysis: Mathematical Foundations , 1998 .

[9]  John Quackenbush,et al.  Genesis: cluster analysis of microarray data , 2002, Bioinform..

[10]  Deept Kumar,et al.  Turning CARTwheels: an alternating algorithm for mining redescriptions , 2003, KDD.

[11]  Mohammed J. Zaki Generating non-redundant association rules , 2000, KDD '00.

[12]  Jian Pei,et al.  CLOSET+: searching for the best strategies for mining frequent closed itemsets , 2003, KDD '03.

[13]  John L. Pfaltz,et al.  Closure systems and their structure , 2001, Inf. Sci..

[14]  John J. Wyrick,et al.  Chromosomal landscape of nucleosome-dependent gene expression and silencing in yeast , 1999, Nature.