论文信息 - Supporting bi-cluster interpretation in 0/1 data by means of local patterns

Supporting bi-cluster interpretation in 0/1 data by means of local patterns

Clustering or co-clustering techniques have been proved useful in many application domains. A weakness of these techniques remains the poor support for grouping characterization. As a result, interpreting clustering results and discovering knowledge from them can be quite hard. We consider potentially large Boolean data sets which record properties of objects and we assume the availability of a bi-partition which has to be characterized by means of a symbolic description. Our generic approach exploits collections of local patterns which satisfy some user-defined constraints in the data, and a measure of the accuracy of a given local pattern as a bi-cluster characterization pattern. We consider local patterns which are bi-sets, i.e., sets of objects associated to sets of properties. Two concrete examples are formal concepts (i.e., associated closed sets) and the so-called δ-bi-sets (i.e., an extension of formal concepts towards fault-tolerance). We introduce the idea of characterizing query which can be used by experts to support knowledge discovery from bi-partitions thanks to available local patterns. The added-value is illustrated on benchmark data and three real data sets: a medical data set and two gene expression data sets.

Ruggero G. Pensa | Jean-François Boulicaut | Céline Robardet

[1] Arlindo L. Oliveira,et al. Biclustering algorithms for biological data analysis: a survey , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[2] Jean-François Boulicaut,et al. Constraint-Based Mining and Inductive Databases, Springer-Verlag LNCS Volume 3848 , 2005 .

[3] B. S. Baker,et al. Gene Expression During the Life Cycle of Drosophila melanogaster , 2002, Science.

[4] Pier Luca Lanzi,et al. Database support for data mining applications : discovering knowledge with inductive queries , 2004 .

[5] Jean-François Boulicaut,et al. Free-Sets: A Condensed Representation of Boolean Data for the Approximation of Frequency Queries , 2004, Data Mining and Knowledge Discovery.

[6] Ruggero G. Pensa,et al. Assessment of discretization techniques for relevant pattern discovery from gene expression data , 2004, BIOKDD.

[7] Jean-François Boulicaut,et al. Constraint-Based Mining of Formal Concepts in Transactional Data , 2004, PAKDD.

[8] Douglas H. Fisher,et al. Knowledge Acquisition Via Incremental Conceptual Clustering , 1987, Machine Learning.

[9] Inderjit S. Dhillon,et al. Information-theoretic co-clustering , 2003, KDD '03.

[10] Leo A. Goodman,et al. Corrigenda: Measures of Association for Cross Classifications , 1957 .

[11] Bruno Crémilleux,et al. Mining Frequent delta-Free Patterns in Large Databases , 2005, Discovery Science.