P3m - Possibilistic Multi-Step Maxmin and Merging Algorithm with Application to Gene Expression Data Mining

Gene expression data generated by DNA microarray experiments provide a vast resource of medical diagnostic and disease understanding. Unfortunately, the large amount of data makes it hard, sometimes impossible, to understand the correct behavior of genes. In this work, we develop a possibilistic approach for mining gene microarray data. Our model consists of two steps. In the first step, we use possibilistic clustering to partition the data into groups (or clusters). The optimal number of clusters is evaluated automatically from data using the Partition Information Entropy as a validity measure. In the second step, we select from each computed cluster the most representative genes and model them as a graph called a proximity graph. This set of graphs (or hyper-graph) will be used to predict the function of new and previously unknown genes. Benchmark results on real-world data sets reveal a good performance of our model in computing optimal partitions even in the presence of noise; and a high prediction accuracy on unknown genes.

[1]  Norbert Stoll,et al.  A min-max approach to fuzzy clustering, estimation, and identification , 2006, IEEE Transactions on Fuzzy Systems.

[2]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[3]  Vijay V. Raghavan,et al.  A new fuzzy clustering algorithm for optimally finding granular prototypes , 2005, Int. J. Approx. Reason..

[4]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[5]  Shengrui Wang,et al.  ON COMPUTING THE FUZZIFIER IN ↓FLVQ: A DATA DRIVEN APPROACH , 2002 .

[6]  Cheng-Yan Kao,et al.  An evolutionary approach for gene expression patterns , 2004, IEEE Transactions on Information Technology in Biomedicine.

[7]  Adrian E. Raftery,et al.  Model-based clustering and data transformations for gene expression data , 2001, Bioinform..

[8]  G. Getz,et al.  Coupled two-way clustering analysis of gene microarray data. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[9]  Daniel A. Keim,et al.  An Efficient Approach to Clustering in Large Multimedia Databases with Noise , 1998, KDD.

[10]  Hong Yan,et al.  Cluster analysis of gene expression data based on self-splitting and merging competitive learning , 2004, IEEE Transactions on Information Technology in Biomedicine.

[11]  Reinhard Guthke,et al.  Gene Expression Data Mining for Functional Genomics , 2001 .

[12]  B. Efron The jackknife, the bootstrap, and other resampling plans , 1987 .

[13]  Julius T. Tou,et al.  Pattern Recognition Principles , 1974 .

[14]  Roded Sharan,et al.  CLICK: A Clustering Algorithm for Gene Expression Analysis , 2000, ISMB 2000.

[15]  C. Müller,et al.  Large-scale clustering of cDNA-fingerprinting data. , 1999, Genome research.

[16]  Shengrui Wang,et al.  FCM-Based Model Selection Algorithms for Determining the Number of Clusters , 2004, Pattern Recognit..

[17]  Geoffrey J. McLachlan,et al.  A mixture model-based approach to the clustering of microarray expression data , 2002, Bioinform..

[18]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[19]  Federico Divina,et al.  Biclustering of expression data with evolutionary computation , 2006, IEEE Transactions on Knowledge and Data Engineering.

[20]  Pierre Baldi,et al.  DNA Microarrays and Gene Expression - From Experiments to Data Analysis and Modeling , 2002 .

[21]  Edward Keedwell,et al.  Discovering Gene Networks with a Neural-Genetic Hybrid , 2005, TCBB.

[22]  Padraig Cunningham,et al.  Application of Simulated Annealing to the Biclustering of Gene Expression Data , 2006, IEEE Transactions on Information Technology in Biomedicine.

[23]  Philip S. Yu,et al.  /spl delta/-clusters: capturing subspace correlation in a large data set , 2002, Proceedings 18th International Conference on Data Engineering.

[24]  Taizo Hanai,et al.  Analysis of expression profile using fuzzy adaptive resonance theory , 2002, Bioinform..

[25]  James M. Keller,et al.  A possibilistic approach to clustering , 1993, IEEE Trans. Fuzzy Syst..

[26]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[27]  Aidong Zhang,et al.  Interrelated two-way clustering: an unsupervised approach for gene expression data analysis , 2001, Proceedings 2nd Annual IEEE International Symposium on Bioinformatics and Bioengineering (BIBE 2001).

[28]  A.K.C. Wong,et al.  Attribute clustering for grouping, selection, and classification of gene expression data , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[29]  Gerald Kowalski,et al.  Information Retrieval Systems: Theory and Implementation , 1997 .

[30]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[31]  Carl G. Looney,et al.  Interactive clustering and merging with a new fuzzy expected value , 2002, Pattern Recognit..

[32]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  Arlindo L. Oliveira,et al.  Biclustering algorithms for biological data analysis: a survey , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[34]  James M. Keller,et al.  A possibilistic fuzzy c-means clustering algorithm , 2005, IEEE Transactions on Fuzzy Systems.

[35]  K. Rose Deterministic annealing for clustering, compression, classification, regression, and related optimization problems , 1998, Proc. IEEE.

[36]  Jian Pei,et al.  DHC: a density-based hierarchical clustering method for time series gene expression data , 2003, Third IEEE Symposium on Bioinformatics and Bioengineering, 2003. Proceedings..

[37]  Shengrui Wang,et al.  On Computing the Fuzzifier in -FLVQ: A Data Driven Approach , 2002, Int. J. Neural Syst..

[38]  Eric D Wieben,et al.  Primer on medical genomics. Part III: Microarray experiments and data analysis. , 2002, Mayo Clinic proceedings.

[39]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[40]  Aidong Zhang,et al.  Cluster analysis for gene expression data: a survey , 2004, IEEE Transactions on Knowledge and Data Engineering.

[41]  Xin Yao,et al.  An evolutionary clustering algorithm for gene expression microarray data analysis , 2006, IEEE Transactions on Evolutionary Computation.

[42]  James C. Bezdek,et al.  On cluster validity for the fuzzy c-means model , 1995, IEEE Trans. Fuzzy Syst..

[43]  David R. Bickel,et al.  Robust Cluster Analysis of Microarray Gene Expression Data with the Number of Clusters Determined Biologically , 2003, Bioinform..

[44]  Kathleen Marchal,et al.  Adaptive quality-based clustering of gene expression profiles , 2002, Bioinform..

[45]  Alfonso Valencia,et al.  A hierarchical unsupervised growing neural network for clustering gene expression patterns , 2001, Bioinform..