Assessment of discretization techniques for relevant pattern discovery from gene expression data

In the domain of gene expression data analysis, various researchers have recently emphasized the promising application of pattern discovery techniques like association rule mining or formal concept extraction from boolean matrices that encode gene properties. To take the most from these approaches, a needed step concerns gene property encoding (e.g., over-expression) and its need for the discretization of raw gene expression data. The impact of this preprocessing step on both the quantity and the relevancy of the extracted patterns is crucial. In this paper, we study the impact of discretization parameters by a sound comparison between the dendrograms, i.e., trees that are generated by a hierarchical clustering algorithm, computed from raw expression data and from the various derived boolean matrices. Thanks to a new similarity measure and practical validation over several gene expression data sets, we propose a method that supports the choice of a discretization technique and its parameters for each specific data set.

[1]  Jean-François Boulicaut,et al.  Using transposition for pattern discovery from microarray data , 2003, DMKD '03.

[2]  Sven Bergmann,et al.  Iterative signature algorithm for the analysis of large-scale gene expression data. , 2002, Physical review. E, Statistical, nonlinear, and soft matter physics.

[3]  Jean-François Boulicaut,et al.  Constraint-based concept mining and its application to microarray data analysis , 2005, Intell. Data Anal..

[4]  P. Brown,et al.  Exploring the metabolic and genetic control of gene expression on a genomic scale. , 1997, Science.

[5]  Jean-François Boulicaut,et al.  Constraint-Based Mining of Formal Concepts in Transactional Data , 2004, PAKDD.

[6]  Ji Huang,et al.  [Serial analysis of gene expression]. , 2002, Yi chuan = Hereditas.

[7]  C. Niehrs,et al.  Synexpression groups in eukaryotes , 1999, Nature.

[8]  S. Altschul,et al.  SAGEmap: a public gene expression resource. , 2000, Genome research.

[9]  B. Dasgupta,et al.  On distances between phylogenetic trees , 1997, SODA '97.

[10]  Xin He,et al.  On computing the nearest neighbor interchange distance , 1999, Discrete Mathematical Problems with Medical Applications.

[11]  A. D. Gordon,et al.  Obtaining common pruned trees , 1985 .

[12]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[13]  Mohammed J. Zaki,et al.  CHARM: An Efficient Algorithm for Closed Itemset Mining , 2002, SDM.

[14]  Jean-François Boulicaut,et al.  Frequent Closures as a Concise Representation for Binary Data Mining , 2000, PAKDD.

[15]  Heikki Mannila,et al.  Fast Discovery of Association Rules , 1996, Advances in Knowledge Discovery and Data Mining.

[16]  Chad Creighton,et al.  Mining gene expression databases for association rules , 2003, Bioinform..

[17]  E. Salmon Gene Expression During the Life Cycle of Drosophila melanogaster , 2002 .

[18]  C. Becquet,et al.  Strong-association-rule mining for large-scale gene-expression data analysis: a case study on human SAGE data , 2002, Genome Biology.

[19]  D. Robinson Comparison of labeled trees with valency three , 1971 .

[20]  J. Derisi,et al.  The Transcriptome of the Intraerythrocytic Developmental Cycle of Plasmodium falciparum , 2003, PLoS biology.

[21]  Mikkel Thorup,et al.  An O(n log n) algorithm for the maximum agreement subtree problem for binary trees , 1996, SODA '96.

[22]  G. Moore,et al.  An iterative approach from the standpoint of the additive hypothesis to the dendrogram problem posed by molecular data sets. , 1973, Journal of theoretical biology.

[23]  Jian Pei,et al.  CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets , 2000, ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery.

[24]  Jean-François Boulicaut,et al.  Mining Concepts from Large SAGE Gene Expression Matrices , 2003, KDID.

[25]  Yaniv Ziv,et al.  Revealing modular organization in the yeast transcriptional network , 2002, Nature Genetics.