Using Classification and Visualization on Pattern Databases for Gene Expression Data Analysis

We are designing new data mining techniques on gene ex- pression data, more precisely inductive querying techniques that extract a priori interesting bi-sets, i.e., sets of objects (or biological situations) and associated sets of attributes (or genes). The so-called (formal) con- cepts are important special cases of a priori interesting bi-sets in derived boolean expression matrices, e.g., matrices that encode over-expression of genes. It has been shown recently that the extraction of every concept is often possible from typical gene expression data because the number of biological situations is generally quite small (a few tens). In specic applications, we have been able to extract every concept and it can lead to millions of concepts. Obviously, post-processing these huge volumes of patterns for the discovery of biologically relevant information is challeng- ing. It is useful since the added-value of transcription module discovery is very high and formal concepts can be seen as putative transcription modules. We describe our ongoing research on concept post-processing by means of classication and visualization. It has been applied to a real-life gene expression data set with a promising feedback from end-users.

[1]  C. Becquet,et al.  Strong-association-rule mining for large-scale gene-expression data analysis: a case study on human SAGE data , 2002, Genome Biology.

[2]  Jean-François Boulicaut,et al.  Approximation of Frequency Queris by Means of Free-Sets , 2000, PKDD.

[3]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[4]  Yaniv Ziv,et al.  Revealing modular organization in the yeast transcriptional network , 2002, Nature Genetics.

[5]  Jean-François Boulicaut,et al.  Mining Concepts from Large SAGE Gene Expression Matrices , 2003, KDID.

[6]  Mohammed J. Zaki,et al.  CHARM: An Efficient Algorithm for Closed Itemset Mining , 2002, SDM.

[7]  Heikki Mannila,et al.  Fast Discovery of Association Rules , 1996, Advances in Knowledge Discovery and Data Mining.

[8]  Jean-François Boulicaut,et al.  Constraint-based concept mining and its application to microarray data analysis , 2005, Intell. Data Anal..

[9]  Gerd Stumme,et al.  Mining frequent patterns with counting inference , 2000, SKDD.

[10]  Gediminas Adomavicius,et al.  Handling very large numbers of association rules in the analysis of microarray data , 2002, KDD.

[11]  Jian Pei,et al.  CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets , 2000, ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery.

[12]  Nicolas Pasquier,et al.  Efficient Mining of Association Rules Using Closed Itemset Lattices , 1999, Inf. Syst..

[13]  Ji Huang,et al.  [Serial analysis of gene expression]. , 2002, Yi chuan = Hereditas.

[14]  C. Niehrs,et al.  Synexpression groups in eukaryotes , 1999, Nature.

[15]  S. Altschul,et al.  SAGEmap: a public gene expression resource. , 2000, Genome research.

[16]  Jean-François Boulicaut,et al.  Inductive Databases and Multiple Uses of Frequent Itemsets: The cInQ Approach , 2004, Database Support for Data Mining Applications.

[17]  P. Brown,et al.  Exploring the metabolic and genetic control of gene expression on a genomic scale. , 1997, Science.

[18]  Jean-François Boulicaut,et al.  Using transposition for pattern discovery from microarray data , 2003, DMKD '03.

[19]  Sven Bergmann,et al.  Iterative signature algorithm for the analysis of large-scale gene expression data. , 2002, Physical review. E, Statistical, nonlinear, and soft matter physics.

[20]  Jean-François Boulicaut,et al.  Optimizing subset queries: a step towards SQL-based inductive databases for itemsets , 2004, SAC '04.

[21]  Ben Shneiderman,et al.  Interactively Exploring Hierarchical Clustering Results , 2002, Computer.