Mutual Information Clustering for Efficient Mining of Fuzzy Association Rules with Application to Gene Expression Data Analysis

Fuzzy association rules can reveal useful dependencies and interactions hidden in large gene expression data sets. However their derivation perplexes very difficult combinatorial problems that depend heavily on the size of these sets. The paper follows a divide and conquer approach to the problem that obtains computationally manageable solutions. Initially we cluster genes that more probably are associated. Thereafter, the fuzzy association rule extraction algorithms confront many but significantly reduced computationally problems that usually can be processed very fast. The clustering phase is accomplished by means of an approach based on mutual information (MI). This approach uses the mutual information as a similarity measure. However, the numerical evaluation of the MI is subtle. We experiment with the main methods and we compare between them. As the device that implements the mutual information clustering we use a SOM (Self-Organized Map) based approach that is capable of effectively incorporating supervised bias. After the mutual information clustering phase the fuzzy association rules are extracted locally on a per cluster basis. The paper presents an application of the techniques for mining the gene expression data. However, the presented techniques can easily be adapted and can be fruitful for intelligent exploration of any other similar data set as well.

[1]  Esa Alhoniemi,et al.  Clustering of the self-organizing map , 2000, IEEE Trans. Neural Networks Learn. Syst..

[2]  P. Brown,et al.  Exploring the metabolic and genetic control of gene expression on a genomic scale. , 1997, Science.

[3]  Bala Srinivasan,et al.  Dynamic self-organizing maps with controlled growth for knowledge discovery , 2000, IEEE Trans. Neural Networks Learn. Syst..

[4]  Christopher M. Bishop,et al.  GTM: The Generative Topographic Mapping , 1998, Neural Computation.

[5]  T. Schreiber,et al.  Surrogate time series , 1999, chao-dyn/9909037.

[6]  Bernhard Schölkopf,et al.  New Support Vector Algorithms , 2000, Neural Computation.

[7]  Francisco Azuaje,et al.  A computational neural approach to support the discovery of gene function and classes of cancer , 2001, IEEE Transactions on Biomedical Engineering.

[8]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[9]  Van Hulle MM Kernel-Based Equiprobabilistic Topographic Map Formation. , 1998, Neural computation.

[10]  Anastasios Bezerianos,et al.  Gene expression data analysis with a dynamically extended self-organized map that exploits class information , 2002, Bioinform..

[11]  A. Atiya,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2005, IEEE Transactions on Neural Networks.

[12]  Satoru Miyano,et al.  Bayesian Network and Nonparametric Heteroscedastic Regression for Nonlinear Modeling of Genetic Network , 2003, J. Bioinform. Comput. Biol..

[13]  Lawrence Hunter,et al.  GEST: a gene expression search tool based on a novel Bayesian similarity metric , 2001, ISMB.

[14]  Samuel Kaski,et al.  Clustering Based on Conditional Distributions in an Auxiliary Space , 2002, Neural Computation.

[15]  Ayumi Shinohara,et al.  Efficiently Finding Regulatory Elements Using Correlation with Gene Expression , 2004, J. Bioinform. Comput. Biol..

[16]  Paulo J. G. Lisboa,et al.  The generative topographic mapping as a principal model for data visualization and market segmentation: an electronic commerce case , 2000, Int. J. Comput. Syst. Signals.

[17]  Ying Xu,et al.  Cubic: Identification of Regulatory Binding Sites through Data Clustering , 2003, J. Bioinform. Comput. Biol..

[18]  Bernd Fritzke Growing Grid — a self-organizing network with constant neighborhood range and adaptation strength , 1995, Neural Processing Letters.

[19]  Vassilios Petridis,et al.  Fuzzy lattice neural network (FLNN): a hybrid model for learning , 1998, IEEE Trans. Neural Networks.

[20]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[21]  Alejandro Sierra,et al.  Reclassification as Supervised Clustering , 2000, Neural Computation.

[22]  Spiridon D. Likothanassis,et al.  Kernel-based Self-organized Maps Trained with Supervised Bias for Gene Expression Data Analysis , 2004, J. Bioinform. Comput. Biol..

[23]  Nicol N. Schraudolph,et al.  Gradient-based manipulation of nonparametric entropy estimates , 2004, IEEE Transactions on Neural Networks.

[24]  Satoru Miyano,et al.  Combining Microarrays and Biological Knowledge for Estimating Gene Networks via Bayesian Networks , 2004, J. Bioinform. Comput. Biol..

[25]  D Haussler,et al.  Knowledge-based analysis of microarray gene expression data by using support vector machines. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[26]  J. Mesirov,et al.  Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[27]  Colin Campbell,et al.  The Latent Process Decomposition of cDNA Microarray Data Sets , 2005, TCBB.

[28]  Teuvo Kohonen,et al.  Self-Organizing Maps, Second Edition , 1997, Springer Series in Information Sciences.

[29]  Jennie Si,et al.  Dynamic topology representing networks , 2000, Neural Networks.

[30]  Gail A. Carpenter,et al.  S-TREE: self-organizing trees for data clustering and online vector quantization , 2001, Neural Networks.

[31]  Patrik D'haeseleer,et al.  Linear Modeling of mRNA Expression Levels During CNS Development and Injury , 1998, Pacific Symposium on Biocomputing.

[32]  Xiaobo Zhou,et al.  Construction of genomic networks using mutual-information clustering and reversible-jump Markov-chain-Monte-Carlo predictor design , 2003, Signal Process..

[33]  Alexander J. Smola,et al.  Learning with Kernels: support vector machines, regularization, optimization, and beyond , 2001, Adaptive computation and machine learning series.

[34]  Huiqing Liu,et al.  Data Mining Tools for Biological Sequences , 2003, J. Bioinform. Comput. Biol..

[35]  Ian Witten,et al.  Data Mining , 2000 .

[36]  Satoru Kuhara,et al.  Multiclass Molecular Cancer Classification by Kernel Subspace Methods with Effective Kernel Parameter Selection , 2005, J. Bioinform. Comput. Biol..

[37]  A.K.C. Wong,et al.  Attribute clustering for grouping, selection, and classification of gene expression data , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[38]  Chad Creighton,et al.  Mining gene expression databases for association rules , 2003, Bioinform..

[39]  Andreas Zell,et al.  Externally Growing Cell Structures for Data Evaluation of Chemical Gas Sensors , 2001, Neural Computing & Applications.

[40]  Spiridon D. Likothanassis,et al.  Kernel-based Self-organized Maps Trained with Supervised Bias for Gene Expression Data Analysis , 2004, J. Bioinform. Comput. Biol..

[41]  Michal Linial,et al.  Using Bayesian Networks to Analyze Expression Data , 2000, J. Comput. Biol..

[42]  Anastasios Bezerianos,et al.  Ischemia detection with a self-organizing map supplemented by supervised learning , 2001, IEEE Trans. Neural Networks.

[43]  Nobuhide Aruga,et al.  Kernel-Based Topographic Map Formation Using with q-Gaussian Kernel , 2003 .

[44]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[45]  Marc M Van Hulle Joint entropy maximization in kernel-based topographic maps. , 2002, Neural computation.

[46]  M. Morley,et al.  Making and reading microarrays , 1999, Nature Genetics.

[47]  Marc M. Van Hulle Kernel-Based Topographic Map Formation by Local Density Modeling , 2002, Neural Computation.

[48]  M. Roulston Estimating the errors on measured entropy and mutual information , 1999 .

[49]  Byoung-Tak Zhang,et al.  Bayesian Network Learning with Feature Abstraction for Gene-drug Dependency Analysis , 2005, J. Bioinform. Comput. Biol..

[50]  Moon,et al.  Estimation of mutual information using kernel density estimators. , 1995, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[51]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[52]  W. Ebeling,et al.  Finite sample effects in sequence analysis , 1994 .

[53]  Alfonso Valencia,et al.  A hierarchical unsupervised growing neural network for clustering gene expression patterns , 2001, Bioinform..

[54]  James R. Williamson,et al.  Self-Organization of Topographic Mixture Networks Using Attentional Feedback , 2001, Neural Computation.

[55]  Madhuri S. Mulekar Data Mining: Multimedia, Soft Computing, and Bioinformatics , 2004, Technometrics.