Towards Fault-Tolerant Formal Concept Analysis

Given Boolean data sets which record properties of objects, Formal Concept Analysis is a well-known approach for knowledge discovery. Recent application domains, e.g., for very large data sets, have motivated new algorithms which can perform constraint-based mining of formal concepts (i.e., closed sets on both dimensions which are associated by the Galois connection and satisfy some user-defined constraints). In this paper, we consider a major limit of these approaches when considering noisy data sets. This is indeed the case of Boolean gene expression data analysis where objects denote biological experiments and attributes denote gene expression properties. In this type of intrinsically noisy data, the Galois association is so strong that the number of extracted formal concepts explodes. We formalize the computation of the so-called δ-bi-sets as an alternative for capturing strong associations between sets of objects and sets of properties. Based on a previous work on approximate condensed representations of frequent sets by means of δ-free itemsets, we get an efficient technique which can be applied on large data sets. An experimental validation on both synthetic and real data is given. It confirms the added-value of our approach w.r.t. formal concept discovery, i.e., the extraction of smaller collections of relevant associations.

[1]  Douglas H. Fisher,et al.  Knowledge Acquisition Via Incremental Conceptual Clustering , 1987, Machine Learning.

[2]  Heikki Mannila,et al.  Levelwise Search and Borders of Theories in Knowledge Discovery , 1997, Data Mining and Knowledge Discovery.

[3]  Ruggero G. Pensa,et al.  Assessment of discretization techniques for relevant pattern discovery from gene expression data , 2004, BIOKDD.

[4]  Jean-François Boulicaut,et al.  Constraint-Based Mining of Formal Concepts in Transactional Data , 2004, PAKDD.

[5]  M. Eisen,et al.  Why PLoS Became a Publisher , 2003, PLoS biology.

[6]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[7]  Inderjit S. Dhillon,et al.  Information-theoretic co-clustering , 2003, KDD '03.

[8]  Céline Robardet,et al.  Efficient Local Search in Conceptual Clustering , 2001, Discovery Science.

[9]  Arlindo L. Oliveira,et al.  Biclustering algorithms for biological data analysis: a survey , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[10]  Jean-François Boulicaut,et al.  Free-Sets: A Condensed Representation of Boolean Data for the Approximation of Frequency Queries , 2004, Data Mining and Knowledge Discovery.

[11]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[12]  J. Derisi,et al.  The Transcriptome of the Intraerythrocytic Developmental Cycle of Plasmodium falciparum , 2003, PLoS biology.

[13]  Germano Resconi,et al.  A context model for fuzzy concept analysis based upon modal logic , 2004, Inf. Sci..

[14]  Francesco Bonchi,et al.  Knowledge Discovery in Inductive Databases, 4th International Workshop, KDID 2005, Porto, Portugal, October 3, 2005, Revised Selected and Invited Papers , 2006, KDID.

[15]  George M. Church,et al.  Biclustering of Expression Data , 2000, ISMB.

[16]  Jean-François Boulicaut,et al.  Mining Formal Concepts with a Bounded Number of Exceptions from Transactional Data , 2004, KDID.

[17]  Sergei O. Kuznetsov,et al.  Comparing performance of algorithms for generating concept lattices , 2002, J. Exp. Theor. Artif. Intell..

[18]  Bart Goethals,et al.  Tiling Databases , 2004, Discovery Science.

[19]  Dino Pedreschi,et al.  Knowledge Discovery in Databases: PKDD 2004 , 2004, Lecture Notes in Computer Science.

[20]  Gerd Stumme,et al.  Computing iceberg concept lattices with T , 2002, Data Knowl. Eng..

[21]  E. Salmon Gene Expression During the Life Cycle of Drosophila melanogaster , 2002 .

[22]  Aristides Gionis,et al.  Geometric and Combinatorial Tiles in 0-1 Data , 2004, PKDD.